WO2022161264A1 - 音频信号处理、会议记录与呈现方法、设备、系统及介质 - Google Patents

音频信号处理、会议记录与呈现方法、设备、系统及介质 Download PDF

Info

Publication number
WO2022161264A1
WO2022161264A1 PCT/CN2022/073092 CN2022073092W WO2022161264A1 WO 2022161264 A1 WO2022161264 A1 WO 2022161264A1 CN 2022073092 W CN2022073092 W CN 2022073092W WO 2022161264 A1 WO2022161264 A1 WO 2022161264A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
audio clips
clips
layer
clustering
Prior art date
Application number
PCT/CN2022/073092
Other languages
English (en)
French (fr)
Inventor
郑斯奇
索宏彬
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2022161264A1 publication Critical patent/WO2022161264A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present application relates to the technical field of audio signal processing, and in particular, to a method, device, system and medium for audio signal processing, conference recording and presentation.
  • the speaker In order to facilitate the understanding of the speaker information corresponding to the speech content during the query, after the voice signal is collected, the speaker needs to be identified, that is, to identify "which speech content was said by which speaker".
  • a neural network model is used to extract voiceprint features in a speech signal, and the speech content corresponding to the same speaker is distinguished according to the voiceprint features.
  • Various aspects of the present application provide an audio signal processing, conference recording and presentation method, device, system and medium, so as to more accurately identify the audio segment corresponding to the same speaker and improve the efficiency of identification.
  • An embodiment of the present application provides an audio signal processing method, including: identifying a speaker change point in an audio signal collected in a multi-person speech scene; dividing the audio signal into multiple audio segments according to the speaker change point, and Extract the voiceprint features of multiple audio clips; perform hierarchical clustering on multiple audio clips according to the duration and voiceprint features of multiple audio clips to obtain audio clips corresponding to the same speaker; audio clips corresponding to the same speaker The clips add the same user marker to get the user marker added audio signal.
  • the embodiment of the present application also provides an audio signal processing method, including: performing sound source localization on an audio signal collected in a multi-person speech scene to obtain a change point of the sound source position;
  • the signal is divided into multiple audio clips, and the voiceprint features of the multiple audio clips are extracted; according to the voiceprint features and sound source positions of the multiple audio clips, the multiple audio clips are clustered to obtain the audio corresponding to the same speaker segment; add the same user tag to the audio segment corresponding to the same speaker to obtain a user-tagged audio signal.
  • An embodiment of the present application further provides a method for recording a conference, including: collecting audio signals in a multi-person conference scene, identifying a speaker change point in the audio signal; and segmenting the audio signal according to the speaker change point is multiple audio clips, and extracts the voiceprint features of the multiple audio clips; according to the duration and voiceprint features of the multiple audio clips, perform hierarchical clustering on the multiple audio clips to obtain corresponding identical The audio clip of the speaker; adding the same user mark to the audio clip corresponding to the same speaker, and generating conference record information according to the audio signal with the user mark added, where the conference record information includes the conference identification.
  • An embodiment of the present application further provides a method for presenting conference records, including: receiving a conference review request, where the conference review request includes a conference identifier to be presented; obtaining meeting record information to be presented according to the conference identifier; presenting the conference Recording information, the meeting recording information is generated according to the audio signal marked by the user in the multi-person conference scene; wherein, among the multiple audio clips cut out according to the speaker change point in the audio signal, corresponding to the same The audio clips of the speaker are added with the same user mark, and the audio clips corresponding to the same speaker are obtained by hierarchically clustering the multiple audio clips according to the duration and voiceprint features of the multiple audio clips.
  • the embodiment of the present application further provides an audio processing system, including: a sound pickup device and a server device; the sound pickup device is deployed in a multi-person speech scene, and is used to collect audio signals in the multi-person speech scene, and identify the audio signals in the audio signals.
  • Speaker change point which divides the audio signal into multiple audio clips according to the speaker change point, and extracts the voiceprint features corresponding to the multiple audio clips;
  • the server device is used to divide the audio signal into multiple audio clips according to the duration and voiceprint features of the multiple audio clips. , perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user mark to the audio clips corresponding to the same speaker to obtain an audio signal with user mark added.
  • the embodiment of the present application further provides an audio processing system, including: a sound pickup device and a server device; the sound pickup device is deployed in a multi-person speech scene, and is used to collect audio signals in the multi-person speech scene, and identify the audio signals in the audio signals.
  • the speaker change point is used to divide the audio signal into multiple audio segments according to the speaker change point; the server device is used to extract the voiceprint features corresponding to the multiple audio segments.
  • Hierarchical clustering is performed on multiple audio clips to obtain audio clips corresponding to the same speaker; the same user mark is added to the audio clips corresponding to the same speaker to obtain user marked audio signals.
  • the embodiment of the present application also provides a sound pickup device, including: a processor and a memory; the memory is used for storing a computer program; the processor is coupled with the memory and is used for executing the computer program, so as to be used for: identifying in a multi-person speech scene
  • the speaker change point in the collected audio signal according to the speaker change point, the audio signal is divided into multiple audio clips, and the voiceprint features of the multiple audio clips are extracted; according to the duration and voiceprint features of the multiple audio clips , perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user mark to the audio clips corresponding to the same speaker to obtain an audio signal with user mark added.
  • the embodiment of the present application also provides a sound pickup device, including: a processor and a memory; the memory is used for storing a computer program; the processor is coupled with the memory and is used for executing the computer program, so as to be used for: in a multi-person speech scenario
  • the collected audio signal is used for sound source localization to obtain the change point of the sound source position; according to the change point of the sound source position, the audio signal is divided into multiple audio clips, and the voiceprint features of the multiple audio clips are extracted;
  • the voiceprint features and sound source positions of each audio clip are clustered to obtain the audio clips corresponding to the same speaker; the same user tag is added to the audio clips corresponding to the same speaker to obtain the user-marked audio clips.
  • audio signal including: a processor and a memory; the memory is used for storing a computer program; the processor is coupled with the memory and is used for executing the computer program, so as to be used for: in a multi-person speech scenario
  • the collected audio signal is used for sound source localization to obtain
  • the embodiment of the present application also provides a server device, including: a processor and a memory; the memory is used to store a computer program; the processor is coupled to the memory and used to execute the computer program, so as to: audio clips and their corresponding voiceprint features; according to the duration and voiceprint features of multiple audio clips, perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; for the audio clips corresponding to the same speaker The same user-marked audio clips are added to obtain a user-marked audio signal.
  • a server device including: a processor and a memory; the memory is used to store a computer program; the processor is coupled to the memory and used to execute the computer program, so as to: audio clips and their corresponding voiceprint features; according to the duration and voiceprint features of multiple audio clips, perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; for the audio clips corresponding to the same speaker The same user-marked audio clips are added to obtain a user-marked audio signal.
  • the embodiment of the present application also provides a server device, including: a processor and a memory; the memory is used to store a computer program; the processor is coupled to the memory and used to execute the computer program, so as to: audio clips; extract the voiceprint features corresponding to the multiple audio clips, and perform hierarchical clustering on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips to obtain the audio clips corresponding to the same speaker; Audio clips of the same speaker are added with the same user tag to obtain a user tagged audio signal.
  • a server device including: a processor and a memory; the memory is used to store a computer program; the processor is coupled to the memory and used to execute the computer program, so as to: audio clips; extract the voiceprint features corresponding to the multiple audio clips, and perform hierarchical clustering on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips to obtain the audio clips corresponding to the same speaker; Audio clips of the same speaker are added with the same user tag to obtain a user tagged audio signal.
  • Embodiments of the present application further provide a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor causes the processor to implement steps in the methods provided by the embodiments of the present application.
  • Embodiments of the present application also provide a computer program product, including computer programs/instructions, which, when executed by a processor, cause the processor to implement steps in each method provided by the embodiments of the present application.
  • the audio signal is first cut into multiple audio segments based on the speaker change point, and then the multiple audio segments are divided into multiple audio segments according to the duration and voiceprint characteristics of the multiple audio segments.
  • hierarchical clustering is performed by combining the duration of audio clips and voiceprint features.
  • Hierarchical clustering can first cluster audio clips with more stable voiceprint characteristics. Compared with clustering all audio clips at the same time, hierarchical clustering can reduce the error caused by the audio clips with unstable voiceprint characteristics, and can more accurately identify the audio clips corresponding to the same speaker, improve the recognition efficiency, and users Labeling results are more accurate.
  • FIG. 1a is a schematic flowchart of an audio signal processing method provided by an exemplary embodiment of the present application
  • FIG. 1b is a schematic flowchart of another audio signal processing method provided by an exemplary embodiment of the present application.
  • FIG. 1c is a schematic flowchart of still another audio signal processing method provided by an exemplary embodiment of the present application.
  • Figure 2a is a schematic diagram of clustering audio clips in each layer
  • Figure 2b is a schematic diagram of clustering audio clips in each layer
  • Figure 2c is a schematic diagram of clustering audio clips in the first layer
  • Fig. 3a is a schematic diagram of the use state of the pickup device in a multi-person conference scenario
  • Figure 3b is a schematic diagram of the use state of the pickup device in the business cooperation and negotiation scenario
  • Figure 3c is a schematic diagram of the use state of the pickup device in the teaching scene
  • 3d is a schematic flowchart of a method for recording a meeting provided by an exemplary embodiment of the present application
  • FIG. 3e is a schematic flowchart of a method for presenting conference records according to an exemplary embodiment of the present application
  • FIG. 4a is a schematic structural diagram of an audio processing system provided by an exemplary embodiment of the present application.
  • FIG. 4b is a schematic structural diagram of another audio processing system provided by an exemplary embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a sound pickup device provided by an exemplary embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a server device according to an exemplary embodiment of the present application.
  • the audio signal of a multi-person speech scene is first cut into multiple audio segments based on the speaker change point, and then according to the duration and voiceprint characteristics of the multiple audio segments, Perform hierarchical clustering of multiple audio clips, identify audio clips corresponding to the same speaker, and add user tags.
  • hierarchical clustering is performed by combining the duration of audio clips and voiceprint features.
  • Hierarchical clustering can first cluster audio clips with more stable voiceprint characteristics. Compared with clustering all audio clips at the same time, hierarchical clustering can reduce the error caused by the audio clips with unstable voiceprint characteristics, and can more accurately identify the audio clips corresponding to the same speaker, improve the recognition efficiency, and users Labeling results are more accurate.
  • FIG. 1a is a schematic flowchart of an audio signal processing method provided by an exemplary embodiment of the present application. As shown in Figure 1a, the method includes:
  • the speaker change point refers to a position point in the audio signal that distinguishes different speakers, that is, the location where the speaker change event occurs, and the number may be one, or multiple, such as two, 3 or 5.
  • the identification method of the speaker change point is not limited, and an example is described below.
  • speaker change points in the audio signal can be identified through Voice Activity Detection (VAD) technology.
  • VAD Voice Activity Detection
  • Endpoints in VAD refer to the transition point between silence and valid speech signal.
  • VAD technology can be used to find the corresponding start and end points of each speech segment, distinguish the speech period from the non-speech period, and also remove mute and noise.
  • the speaker change point may be determined in combination with the pause duration between the start and end points. For example, in the case where the pause time interval between the start point and the end point is greater than the set threshold, the position of the speech endpoint (ie, the start point and the end point) in this case can be regarded as the speaker change point.
  • voiceprint feature extraction may also be performed on the audio signal collected in a multi-person speech scene, and according to the change of the voiceprint feature in the audio signal, the point where the voiceprint in the audio signal changes is used as the speaker change point.
  • the VAD technology can be combined with the voiceprint feature to further combine the voiceprint features at the adjacent starting and ending points for the starting point and ending point corresponding to each speech period detected by VAD. If the adjacent starting point and ending point are If the voiceprint feature at the location changes, the position of the voice endpoint (ie, the starting point and the ending point) can be determined as the speaker change point.
  • sound source localization can be performed on the audio signal based on the microphone array to obtain the change point of the sound source position, and the speaker change point in the audio signal can be determined according to the change point of the sound source position. For example, in a speech scene where the position of each speaker is fixed, the change point of the sound source position can be used as the speaker change point.
  • the speaker may move, that is, the speaking position is not fixed.
  • the sound source localization can be combined with the VAD technology, and the sound source localization technology can be used to locate the change point of the sound source position. , using VAD technology to determine the corresponding starting point and ending point of each speech period in the audio signal; according to the starting point and ending point determined by VAD, correct the change point of the sound source position, so as to obtain the accurate speaker change point.
  • the change point of the sound source position and the VAD detection result can be aligned on the time axis, and it is determined whether there is a detected voice endpoint, such as a start point or an end point, within a certain period of time before and after the change point of the sound source position. If it exists, the location of the voice endpoint can be determined as the speaker change point. In the above manner, the speaker change point can be determined more accurately, and the speech recognition result can be truncated more accurately, so as to avoid the phenomenon of missing prefixes and suffixes.
  • the audio signal can be divided into a plurality of speech segments according to the speaker change point. For example, for a segment of audio signal, its start position is marked as A1, and its end position is marked as A2. When it is found that the audio signal contains a speaker change point B1, the audio signal can be divided into audio segment A1—>B1 and audio segment B1—>A2 according to the speaker change point B1.
  • voiceprint features of the multiple audio clips can be extracted, and the voiceprint features can be represented by feature vectors.
  • the voiceprint feature is the feature representation of the audio clip, and the voiceprint features of the audio clips corresponding to different speakers are generally different.
  • the implementation manner of extracting the voiceprint features of multiple speech segments is not limited.
  • a neural network model for extracting voiceprint features may be pre-trained, and the pre-trained neural network model may be used to extract the voiceprint features of multiple speech segments, wherein the neural network model may be, but not limited to: based on Mel inverted Spectral coefficient (Mel-scale Frequency Cepstral Coefficients, referred to as MFCC) model or Gaussian Mixture Model-Universal Background Model (Gaussian Mixture Model-Universal Background Model, GMM-UBM) and so on.
  • MFCC Mel inverted Spectral coefficient
  • GMM-UBM Gaussian Mixture Model-Universal Background Model
  • each audio clip corresponds to a speaker
  • different audio clips may correspond to the same speaker or may correspond to different speakers .
  • multiple audio clips may be clustered based on the voiceprint features of the multiple audio clips, and try to group the audio clips with the same voiceprint characteristics as much as possible. clustered together.
  • audio clips with the same or similar voiceprint characteristics are regarded as audio clips corresponding to the same user.
  • the duration of the audio clips is further considered, and the multiple audio clips are hierarchically clustered in combination with the durations of the multiple audio clips.
  • the process of clustering the layered audio clips layer by layer is used to give full play to the advantages of longer audio clips and reduce possible interference caused by shorter audio clips. Therefore, in this embodiment, after obtaining multiple audio clips and their voiceprint features, hierarchical clustering is performed on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips, so as to obtain the corresponding speaker. audio clip.
  • the multiple audio clips can be layered according to the duration of the multiple audio clips.
  • the voiceprint features corresponding to the multiple audio clips perform hierarchical clustering on the multi-layer audio clips in the order of the duration range from long to short to obtain at least one clustering result , and each clustering result includes audio clips corresponding to the same speaker.
  • hierarchical clustering not only the voiceprint features are used to cluster multiple audio clips, but the duration of the audio clips is combined.
  • the voiceprint feature clusters audio clips with a short duration.
  • the clustering results of the clips are mainly used, which reduces the recognition error caused by the unstable voiceprint features of the audio clips with short duration, and improves the accuracy of identifying the audio clips corresponding to the same speaker.
  • the same user tag may be added to the audio segment corresponding to the same speaker, so as to obtain an audio signal to which the user tag is added.
  • the implementation of adding the user mark is not limited.
  • a speech segment with a user mark may be inserted before each audio segment, eg, a speech segment "User C1 please speak” may be inserted before the audio segment corresponding to user C1.
  • the same user marker is added to the audio clip corresponding to the same speaker on the audio track. For example, a red marker is added to the audio clip corresponding to speaker C2, a green marker is added to the audio clip corresponding to speaker C3, and the corresponding speech is added. Add yellow markers to the audio clip of Person E, etc.
  • the audio signal is first cut into multiple audio segments based on the speaker change point, and then the multiple audio segments are divided into multiple audio segments according to the duration and voiceprint characteristics of the multiple audio segments.
  • hierarchical clustering is performed by combining the duration of audio clips and voiceprint features.
  • Hierarchical clustering can first cluster audio clips with more stable voiceprint characteristics. Compared with clustering all audio clips at the same time, hierarchical clustering can reduce the error caused by the audio clips with unstable voiceprint characteristics, and can more accurately identify the audio clips corresponding to the same speaker, improve the recognition efficiency, and users Labeling results are more accurate.
  • the multiple audio segments it is not limited to layer the multiple audio segments according to the durations of the multiple audio segments to obtain multi-layer audio segments corresponding to different duration ranges.
  • the number threshold of each layer can be set, the multiple audio clips are sorted according to the duration, and the sorted multiple audio clips are layered according to the preset number threshold of each layer, to get multiple layers of audio clips.
  • the preset duration thresholds of each layer may be used to layer multiple audio clips according to the durations of the multiple audio clips and the preset duration thresholds of each layer to obtain corresponding Multi-layer audio clips with different duration ranges; the smaller the number of layers, the larger the corresponding duration threshold, and the duration of the audio clips in each layer is greater than or equal to the duration threshold of the layer.
  • audio clips with a duration of more than 20s can be divided into the first layer
  • audio clips with a duration of 10s to 20s can be divided into the second layer
  • audio clips with a duration of 5s to 10s can be divided into the third layer
  • audio clips with a duration of less than 5s can be divided into the third layer.
  • the audio clips are divided into the fourth layer.
  • the implementation manner of performing hierarchical clustering on the multi-layer audio clips to obtain at least one clustering result is not limited. Details are described below.
  • the audio clips in each layer can be clustered according to the voiceprint features corresponding to the audio clips in each layer to obtain the clustering result of each layer; according to the order of the layers from small to large , according to the voiceprint feature of the clustering result of each layer, cluster the clustering results of two adjacent layers in turn to obtain at least one clustering result.
  • the clustering result of each layer may be one or multiple, for example, 2, 3, or 5, etc., which are not limited.
  • Figure 2a according to the duration of multiple audio clips, multiple audio clips are divided into three layers, and according to the voiceprint features of each layer of audio clips, each layer of audio clips is clustered, and the clustering results of each layer are obtained.
  • the first layer has two clustering results D1 and D2, the second layer has three clustering results D3, D4 and D5, and the third layer has two clustering results D6 and D7;
  • Voiceprint features cluster the clustering results of the second layer to the clustering results D1 or D2 of the first layer, wherein the clustering results D3, D4 and D5 can be judged according to the voiceprint features of the clustering results D3, D4 and D5.
  • Clustering into the clustering result E1 or E2 assuming that the clustering result D6 can be clustered into the clustering result E1 to obtain the clustering result E3, and the clustering result D7 can be clustered into the clustering result E2 to obtain the clustering result E4 , and finally two clustering results E3 and E4 are obtained, that is, the audio clips corresponding to the two speakers are obtained.
  • the audio clips of the first layer are clustered first, and then based on the clustering results of the first layer, the audio clips of each layer are sorted in descending order of the layers.
  • Cluster the existing clustering results Specifically, for the audio clips in the first layer, according to the voiceprint features corresponding to the audio clips in the first layer, the audio clips in the first layer are clustered to obtain at least one clustering result; For the audio clips in one layer, in the order of the number of layers from small to large, according to the voiceprint features corresponding to the audio clips in the non-first layer, the audio clips in the non-first layer are clustered to the existing clustering results.
  • the audio clips of the first layer are clustered to obtain two clustering results F1 and F2.
  • the clustering result F1 contains the audio clips g1 and audio clips.
  • Fragment g2 the clustering result F2 contains audio fragment g3; then, the audio fragments of the second layer are clustered to the existing clustering results F1 and F2 of the first layer, wherein the second layer contains three audio clips
  • the segments are respectively audio segment g4, audio segment g5 and audio segment g6, then according to the voiceprint features of audio segment g4, audio segment g5 and audio segment g6, determine whether audio segment g4, audio segment g5 and audio segment g6 can be clustered To the clustering result F1 or F2, assuming that the audio segment g5 and the audio segment g6 are clustered into the clustering result F2, and the audio segment g4 cannot be clustered into the clustering results F1 and F2 of the first layer, then the audio
  • the implementation of clustering multiple audio clips is not limited, for example, K-Means (K-Means) clustering, mean-shift clustering, density-based clustering ( DBSCAN), Expectation Maximum (EM) clustering with Gaussian Mixture Model (GMM), Agglomerative Hierarchical Clustering or Graph Community Detection (Graph Community Detection) clustering, etc.
  • the audio clips in the first layer are not limited to the implementation of clustering the audio clips in the first layer according to the voiceprint features corresponding to the audio clips in the first layer to obtain at least one clustering result.
  • An implementation manner of clustering the audio clips in the first layer to obtain at least one clustering result according to the voiceprint features corresponding to the audio clips in the first layer comprising: the first layer comprising at least two audio clips.
  • the overall similarity between the at least two audio clips in the first layer is calculated according to the voiceprint features corresponding to the at least two audio clips in the first layer.
  • the voiceprint feature similarity is used as the overall similarity between the at least two audio clips; according to the overall similarity between the at least two audio clips in the first layer, the at least two audio clips in the first layer are divided into at least one cluster and according to the voiceprint feature corresponding to the audio segment included in the at least one clustering result, respectively calculate the cluster center of the at least one clustering result, and the cluster center includes the central voiceprint feature.
  • the audio clip and other audio clips in the first layer can be calculated according to the voiceprint feature corresponding to the audio clip and the voiceprint features corresponding to other audio clips in the first layer.
  • the overall similarity of The clustering result is obtained, and the cluster center of the target clustering result is updated according to the voiceprint feature corresponding to the audio segment and the target audio segment. Further, it can be calculated whether the target clustering result and the remaining audio clips in the first layer can be clustered. For the remaining audio clips that cannot be clustered into the target clustering result, the remaining audio clips can be classified according to the voiceprint features corresponding to the remaining audio clips. The segments are clustered to produce new clustering results until every audio segment on all the first layers is clustered into a clustering result.
  • the three audio clips are audio clip h1, audio clip h2, and audio clip h3, respectively, and the voiceprint feature similarity of audio clip h1 and audio clip h2 can be calculated first,
  • the voiceprint feature similarity is taken as the overall similarity between the two audio segments h1 and h2. If the overall similarity meets the set condition, it is considered that the audio segment h1 and the audio segment h2 are from the same speaker, and the audio segment h1 and h2 can be regarded as the same speaker.
  • the segment h1 and the audio segment h2 are clustered into a clustering result H1, and the clustering center of the clustering result H1 is calculated, that is, the central voiceprint feature.
  • the voiceprint feature of the audio segment h1 can be directly used as the central voiceprint feature
  • the voiceprint feature of the audio clip h2 can also be used as the central voiceprint feature
  • the voiceprint feature of the audio clip h1 and the voiceprint feature of the audio clip h2 can be averaged to obtain the central voiceprint feature, which is not limited;
  • the similarity between the central voiceprint feature of the clustering result H1 and the voiceprint feature of the audio segment h3 can be calculated, and the similarity of the voiceprint feature can be regarded as the difference between the clustering result H1 and the audio segment h3.
  • the clustering result H1 and the audio clip h3 are from the same speaker, and the clustering result H1 and the audio clip h3 can be clustered into one clustering result H2, and according to the voiceprint features of the clustering result H1 and the audio segment h3, calculate the central voiceprint feature of the clustering result H2; if the similarity threshold does not meet the set condition, it is considered that the clustering result H1 and the audio segment h3 are not different From the same speaker, the audio segment h3 can be regarded as a clustering result H3 alone.
  • the non-first layer in order of the number of layers from small to large, according to the voiceprint features corresponding to the audio clips in the non-first layer, the non-first layer
  • the implementation of clustering the audio clips of the audio clips to the existing clustering results for example, for each audio clip in any non-first layer, according to the voiceprint feature corresponding to the audio clip and the clustering results of the existing clustering results.
  • Class center calculate the overall similarity between the audio clip and the existing clustering results; if there is a target clustering result whose overall similarity with the audio clip meets the set similarity condition in the existing clustering results, the audio clip It is added to the target clustering result, and the cluster center of the target clustering result is updated according to the voiceprint feature corresponding to the audio clip.
  • This embodiment does not limit the implementation of updating the cluster center of the target clustering result according to the voiceprint feature corresponding to the audio segment.
  • each audio segment included in the target clustering result is directly updated.
  • the voiceprint features are averaged, and the new central voiceprint feature is obtained as the cluster center after the target clustering result is updated.
  • the number of layers to which each audio clip included in the target clustering result belongs is determined, different weights are set for different layers, and the smaller the number of layers, the greater the corresponding weight; according to each audio clip The weight corresponding to the number of layers to which it belongs, the weighted summation of the voiceprint features corresponding to each audio segment is performed to obtain a new central voiceprint feature as the updated cluster center of the target clustering result.
  • the target clustering result includes audio clips j1 and j2 of the first layer, and audio clips j3 of the second layer.
  • the weight of the audio clip of the first layer is set to k1, and set the weight to the audio clip of the first layer.
  • an audio signal processing method is also provided in the embodiment of the present application, as shown in FIG. 1b, the method includes:
  • sound source localization is performed on the audio signal to obtain the change point of the sound source position; the audio signal is divided according to the change point of the sound source position to obtain a plurality of audio clips, and each audio clip corresponds to a unique The position of the sound source.
  • each sound source position corresponds to a speaker, that is, each audio segment corresponds to a speaker.
  • the audio clip of the speaker may be divided into two audio clips, one audio clip corresponds to before the position change, and one audio clip corresponds to after the position change, at this time, each audio clip also corresponds to a speaker.
  • the voiceprint features of the multiple audio segments can be extracted.
  • each audio clip corresponds to a speaker
  • different audio clips may correspond to the same speaker, or may correspond to different speakers spokesman.
  • multiple audio clips may be clustered based on the voiceprint features of the multiple audio clips, and try to group the audio clips with the same voiceprint characteristics as much as possible. clustered together.
  • audio clips with the same or similar voiceprint characteristics are regarded as audio clips corresponding to the same user.
  • the sound source positions corresponding to the audio clips can also be combined. If the voiceprint features of the two audio clips are the same or similar and come from the same sound source position, the probability that the two audio clips correspond to the same user will be higher. In addition, considering that the longer the duration of the audio clip is, the more stable the corresponding voiceprint feature will be. On the contrary, the shorter the duration of the audio clip, the less stable the corresponding voiceprint feature will be, and the difference will be lost. So obvious. Therefore, in this embodiment, the duration of the audio clips is further considered, and the multiple audio clips are clustered hierarchically in combination with the durations of the multiple audio clips.
  • the process of clustering layer-by-layer audio clips is used to give full play to the advantages of longer audio clips and reduce possible interference caused by shorter audio clips. Therefore, in this embodiment, after obtaining multiple audio clips, their voiceprint features, and sound source positions, the multiple audio clips are hierarchically aggregated according to the duration, voiceprint features, and sound source positions of the multiple audio clips. class to get audio clips corresponding to the same speaker.
  • an implementation of hierarchically clustering multiple audio clips according to the duration, voiceprint features, and sound source positions of the multiple audio clips includes: Duration, multiple audio clips are layered to obtain multi-layer audio clips corresponding to different duration ranges; according to the voiceprint features and sound source positions corresponding to The audio clips are hierarchically clustered to obtain at least one clustering result, and each clustering result includes audio clips corresponding to the same speaker.
  • the implementation manner of layering multiple audio segments according to the durations of the multiple audio segments to obtain multi-layered audio segments corresponding to different duration ranges can be referred to the foregoing embodiments, which will not be repeated here.
  • it is not limited to perform hierarchical clustering on the multi-layer audio clips according to the voiceprint features and sound source positions corresponding to the multiple audio clips in the order from the longest to the shortest time range, so as to obtain at least one cluster. Implementation of the results. An example is given below.
  • the audio clips in each layer can be clustered according to the corresponding voiceprint features and sound source positions of the audio clips in each layer to obtain the clustering results of each layer; In a large order, according to the voiceprint feature and sound source position of the clustering result of each layer, the clustering results of two adjacent layers are clustered in turn to obtain at least one clustering result.
  • the audio clips of the first layer may be clustered first, and then based on the clustering results of the first layer, the audio clips of each layer are sorted in descending order of the layers. All of them are clustered into the existing clustering results. Specifically, for the audio clips of the first layer, cluster the audio clips in the first layer according to the voiceprint features and sound source positions corresponding to the audio clips in the first layer to obtain at least one clustering result; then, For the audio clips in the non-first layer, in the order of the number of layers from small to large, according to the voiceprint features and sound source positions corresponding to the audio clips in the non-first layer, the audio clips in the non-first layer are transferred to the existing layer.
  • Some clustering results are clustered; and if there are remaining audio clips in the first layer that have not been clustered into the existing clustering results, the remaining audio clips are classified according to the voiceprint features and sound source positions corresponding to the remaining audio clips. Clustering is performed to generate new clustering results until every audio segment on all layers is clustered into at least one clustering result.
  • the implementation of clustering the audio clips in the first layer according to the voiceprint features and sound source positions corresponding to the audio clips in the first layer includes: if the first layer only includes an audio segment, the audio segment itself forms a clustering result; if the first layer contains at least two audio
  • the voiceprint features and sound source positions corresponding to the audio clips are used to calculate the overall similarity between at least two audio clips in the first layer. For example, the voiceprint feature similarity of at least two audio clips can be calculated first, then the sound source location similarity of at least two audio clips can be calculated, and the voiceprint feature similarity and the sound source location similarity can be weighted to obtain the first layer.
  • the at least two audio clips in the first layer may be divided into at least one clustering result according to the overall similarity between the at least two audio clips in the first layer. Further, it is also necessary to calculate the cluster center of at least one clustering result according to the corresponding voiceprint feature and sound source position of the audio clip included in the at least one clustering result, and the clustering center includes the central voiceprint feature and the central sound source. A position that provides a basis for clustering audio segments not on the first layer to the at least one clustering result.
  • the average value of the voiceprint features corresponding to the audio clips included in the clustering result can be taken as the central voiceprint feature of the clustering result, and the audio clips included in the clustering result correspond to The average of the sound source positions is taken as the center sound source position of the clustering result.
  • the voiceprint feature of any audio segment included in the clustering result can be directly used as the central voiceprint feature of the clustering result, and the sound source position of any audio segment included in the clustering result can be used as the The center voiceprint position of the clustering result.
  • the audio clips in the first layer include: audio clip m1-audio clip m6, the overall similarity between any two audio clips can be calculated, and the overall similarity is higher than the set similarity threshold (for example, 90%) for clustering of two audio clips, for example, the overall similarity threshold of audio clip m1 and audio clip m3 is 91%, the overall similarity threshold of audio clip m2 and audio clip m4 is 93%, and the overall similarity threshold of audio clip m3 and audio clip m4 is 93%.
  • the set similarity threshold for example, 90%
  • the overall similarity threshold of the audio segment m6 is 95%, then the audio segment m1 and the audio segment m3 can be clustered to obtain the clustering result M1, and the audio segment m2 and the audio segment m4 can be clustered to obtain the clustering result M2.
  • the clustering result M3 is obtained by clustering m3 and the audio segment m6, and the clustering center of the clustering result M1, the clustering result M2 and the clustering result M3 are calculated respectively.
  • the overall similarity of the class results. If the overall similarity exceeds a set threshold (eg, 90%), the two clustering results will continue to be clustered.
  • the overall similarity between clustering result M1 and clustering result M2 is 90%
  • the overall similarity between clustering result M1 and clustering result M3 is 85%
  • the overall similarity between clustering result M2 and clustering result M3 is 80%
  • the clustering result M1 and the clustering result M2 will continue to be clustered into the clustering result M4, and the clustering result M3 will be regarded as a clustering result alone.
  • the audio clips of the first layer will obtain two clustering results M3. and M4.
  • a process of clustering it to the existing clustering result includes: according to the corresponding voiceprint feature of the audio clip and the sound source position and The cluster center of the existing clustering results, calculate the overall similarity between the audio clip and the existing clustering results; if there is a target cluster whose overall similarity with the audio clip meets the set similarity conditions in the existing clustering results class result, it can be considered that the audio segment and the audio segment in the target clustering result come from the same speaker, add the audio segment to the target clustering result, and update the audio segment according to the corresponding voiceprint feature and sound source position.
  • the cluster center of the target clustering result includes: according to the corresponding voiceprint feature of the audio clip and the sound source position and The cluster center of the existing clustering results, calculate the overall similarity between the audio clip and the existing clustering results; if there is a target cluster whose overall similarity with the audio clip meets the set similarity conditions in the existing clustering results class result, it can be considered that the audio segment and the audio segment in the target clustering result come from the
  • the cluster center of the target clustering result may be updated in but not limited to the following manner.
  • the voiceprint features of all audio clips included in the target clustering result can be averaged, and the average value can be taken as the central voiceprint feature of the cluster center of the target clustering result;
  • the positions are averaged, and the average value is taken as the central sound source position of the cluster center of the target clustering result.
  • the number of layers to which each audio clip included in the target clustering result belongs can be determined, and different weights are set for different layers, and the smaller the number of layers, the greater the corresponding weight; according to the number of layers to which each audio clip belongs Corresponding weights, weighted and summed up the voiceprint features corresponding to each audio clip included in the target clustering result to obtain a new central voiceprint feature; The weighted summation of the sound source positions corresponding to the audio clips contained in the data is carried out to obtain a new central sound source position; the new central voiceprint feature and the new central sound source position form the cluster center after the target clustering result is updated.
  • the audio signal for the audio signal of a multi-person speech scene, the audio signal is first cut into multiple audio segments based on the position of the sound source, and then based on the duration, voiceprint characteristics and sound source location of the multiple audio segments, the The audio clips are clustered hierarchically, and the audio clips corresponding to the same speaker are identified and user tags are added.
  • the voiceprint feature is no longer used for clustering, but the sound source position, voiceprint feature and hierarchical aggregation are combined.
  • the sound source position can accurately segment the audio signal, and the hierarchical aggregation can reduce the On the basis of the impact of short speech on the recognition results, the voiceprint feature is used to identify the audio clips corresponding to the same speaker, which can greatly improve the recognition efficiency and make the user marking results more accurate.
  • This embodiment also provides an audio signal processing method, as shown in FIG. 1c, the method includes:
  • the audio signal is divided into multiple audio clips, and the voiceprint feature of the multiple audio clips is extracted;
  • dividing the audio signal into multiple audio segments according to the change point of the sound source position includes: using the change point of the sound source position as the speaker change point, thereby dividing the audio signal into multiple audio segments; or, combining VAD technology uses VAD technology to detect the starting point and ending point of the audio signal; according to the starting point and ending point, the change point of the sound source position is corrected to obtain the speaker change point, and then the audio signal is cut according to the speaker change point. into multiple audio clips.
  • clustering the multiple audio clips to obtain the audio clips corresponding to the same speaker includes: according to the duration of the multiple audio clips, Layer multiple audio clips to obtain multi-layer audio clips corresponding to different time ranges; according to the voiceprint features and sound source positions corresponding to the multiple audio clips, the multi-layer audio clips are sorted in the order of the duration range from long to short. Hierarchical clustering is performed to obtain at least one clustering result, each of which includes audio clips corresponding to the same speaker.
  • the audio signal is first cut into multiple audio segments based on the speaker change point, and then the multiple audio segments are divided into multiple audio segments according to the duration and voiceprint characteristics of the multiple audio segments.
  • hierarchical clustering is performed by combining the duration of audio clips and voiceprint features.
  • Hierarchical clustering can first cluster audio clips with more stable voiceprint characteristics. Compared with clustering all audio clips at the same time, hierarchical clustering can reduce the error caused by the audio clips with unstable voiceprint characteristics, and can more accurately identify the audio clips corresponding to the same speaker, improve the recognition efficiency, and users Labeling results are more accurate.
  • the audio signal processing methods provided by the embodiments of the present application can be applied to various multi-person speech scenarios, such as multi-person conference scenarios, business meeting scenarios, or teaching scenarios.
  • the sound pickup device of this embodiment will be deployed in these scenarios to collect audio signals in a multi-person speech scenario, and implement all the above method embodiments and the following system embodiments of the present application. other functions described.
  • the placement position of the sound pickup device can be reasonably determined according to the specific deployment situation of the multi-person speech scene.
  • the pickup device is deployed in the center of the conference table, and multiple speakers are distributed in different directions of the pickup device, which is convenient for picking up the voice of each speaker;
  • Figure 3b In the scenario of business cooperation talks, the first business party and the second business party are seated opposite each other, the conference organizer is located between the first business party and the second business party, and is responsible for organizing the negotiation between the two parties, and the pickup equipment is deployed in the conference organizer.
  • the pickup device is deployed on the desk, and the teachers and students are located in different positions of the pickup device, so that the voices of the teacher and the students can be picked up at the same time.
  • a conference recording can be performed for the multi-person conference scenario, and further, the conference recording can be presented or reproduced.
  • a method for recording a meeting provided by an exemplary embodiment of the present application includes the following steps:
  • step 305d For a detailed description of steps 301d-304d, reference may be made to the foregoing embodiments, and details are not described herein again. In this embodiment, the description will focus on step 305d.
  • conference record information can be generated according to the user-marked audio signal, and a corresponding conference identifier is added to the conference record information.
  • the conference identifier is unique and can uniquely identify a multi-person meeting.
  • the user-marked audio signal can be directly used as the conference record information.
  • the user-marked audio signal can be converted into text information with speaker information, and the text information can include content similar to but not limited to the following formats: A speaker: xxxx; Speaker B: yyy, etc.; then use the text message with the speaker information as the meeting record information. No matter what form of meeting record information, the meeting scene can be reproduced based on the meeting record information, which is convenient for querying or consulting the meeting content.
  • FIG. 3e is a schematic flowchart of a method for presenting conference records provided by an exemplary embodiment of the present application. As shown in FIG. 3e, the method includes:
  • 301e Receive a conference review request, where the conference review request includes a conference identifier to be presented;
  • the meeting record information is generated according to the audio signal marked by the user in the multi-person conference scene; wherein, according to the multiple audio clips cut out by the speaker change point in the audio signal, the corresponding The audio clips of the same speaker are added with the same user tag, and the audio clips corresponding to the same speaker are obtained by hierarchically clustering multiple audio clips according to the duration and voiceprint features of the multiple audio clips.
  • conference recording can be performed, and the conference recording process is as follows: collecting audio signals in the multi-person conference scene, identifying the speaker change point in the audio signal; The change point divides the audio signal into multiple audio clips, and extracts the voiceprint features of the multiple audio clips; further, performs hierarchical clustering on the multiple audio clips according to the duration and voiceprint characteristics of the audio clips, and obtains the corresponding same speech. Add the same user mark to the audio clip corresponding to the same speaker to obtain an audio signal with the user mark added; generate conference record information according to the audio signal added with the user mark, and add the corresponding conference ID to the conference record information.
  • the conference recording process is as follows: collecting audio signals in the multi-person conference scene, identifying the speaker change point in the audio signal; The change point divides the audio signal into multiple audio clips, and extracts the voiceprint features of the multiple audio clips; further, performs hierarchical clustering on the multiple audio clips according to the duration and voiceprint characteristics of the audio clips, and obtains the corresponding same speech. Add the same user mark to the
  • the relevant meeting content can be checked through the meeting minutes information, so the meeting checking service is provided to the outside world.
  • a conference review request sent from outside can be received, and the request carries the conference ID to be presented; based on the conference ID and the conference ID in each conference record information, the conference record information to be presented can be obtained therefrom, and the conference is presented. record information.
  • the meeting record information is an audio signal with a user mark added
  • the audio signal with the user mark added can be played through the player, or, the audio signal with the user mark added can also be converted into text information and displayed;
  • the conference record information is text information with speaker information converted from an audio signal marked by a user, and the text information with speaker information can be displayed on a display, or the player can also play the text information with speaker information through a player. Text message for person information. In this way, it is possible to meet the query or query requirements of the conference content.
  • the conference review request may also include speaker information or user tag, and the speaker information and the user tag have a corresponding relationship.
  • the conference to be presented can be obtained according to the conference ID Recording information; according to the speaker information or user mark, obtain part of the meeting content corresponding to the speaker information or user mark in the conference record information, and present part of the meeting content corresponding to the speaker information or user mark.
  • the methods provided in the embodiments of the present application may all be completed by a sound pickup device, or some functions may be implemented on a server device, which is not limited.
  • the sound pickup device may be implemented as a voice recorder, a voice recorder, a tape recorder or a pickup, etc., or may be implemented as a terminal device with a recording function or an audio and video conference device, and the like.
  • this embodiment provides an audio processing system, and describes a process in which the audio signal processing method is jointly implemented based on a sound pickup device and a server device.
  • the audio processing system 400 includes: a sound pickup device 401 and a server device 402 .
  • the audio processing system 400 can be applied to a multi-person speech scene, such as the multi-person conference scene shown in FIG. 3a, the business cooperation negotiation scene shown in FIG. 3b, and the teaching scene shown in FIG. 3c.
  • the sound pickup device 401 can cooperate with the server device 402 to implement the above method embodiments of the present application, and the server device 402 is not shown in the multi-person speech scenarios shown in FIGS. 3 a to 3 c .
  • the sound pickup device 401 in this embodiment has functional modules such as a power-on button, an adjustment button, a microphone array, and a speaker, and further optionally, may also include a display screen.
  • the pickup device 401 can realize functions such as automatic recording, MP3 playback, FM frequency modulation, digital camera function, telephone recording, timing recording, external transcription, repeater or editing.
  • the sound pickup device 401 can collect audio signals in the multi-person speech scene, identify the speaker change point in the audio signal, and divide the audio signal into multiple speakers according to the speaker change point. audio clips, extract the voiceprint features corresponding to the multiple audio clips, and send the multiple audio clips and their corresponding voiceprint features to the server device 402 .
  • the server device 402 may receive multiple audio clips sent by the sound pickup device 401 and their corresponding voiceprint features, and perform hierarchical layering on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips Clustering to obtain the audio clips corresponding to the same speaker; adding the same user mark to the audio clips corresponding to the same speaker to obtain the user marked audio signal.
  • the sound pickup device 401 can use a microphone array to pick up audio signals in a multi-person speech scene, and based on the intensity of the same sound signal picked up by microphones at different positions in the microphone array, the sound source of the sound signal can be calculated. Location. Based on this, in an optional embodiment of the present application, when identifying the speaker change point in the audio signal, the sound pickup device 401 can perform sound source localization on the audio signal to obtain the change point of the sound source position; The position change point determines the speaker change point in the audio signal. Further, multiple audio segments can be cut out according to the speaker change point, and each audio corresponds to a unique sound source position.
  • the server device 402 when the server device 402 performs hierarchical clustering on the multiple audio clips according to the duration and voiceprint characteristics of the multiple audio clips to obtain the audio clip corresponding to the same speaker, it can , voiceprint feature and sound source location, perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker.
  • an audio processing system is also provided.
  • the difference between the embodiment shown in Fig. 4b and the embodiment shown in Fig. 4a is that in Fig. 4a, the audio corresponding to multiple audio segments is extracted.
  • the process of the voiceprint feature is implemented on the voice pickup device 401, and in FIG. 4b, the process of extracting the voiceprint feature corresponding to multiple audio clips is implemented on the server device 402.
  • Other content in FIG. 4a is the same as that shown in FIG. 4b or Similarly, for details, reference may be made to the foregoing embodiments, which will not be repeated here.
  • the server device 402 after the server device 402 adds the same user tag to the audio segment corresponding to the same speaker, it can store the audio signal to which the user tag is added for subsequent query and use.
  • the audio processing system further includes a transcription device 403, and the server device 402 can send the audio signal added with the user mark to the transcription device 403, and the transcription device 403 receives the audio signal.
  • the user-marked audio signal is added, the user-marked audio signal is converted into user-marked text information, and the user-marked text information is returned to the server device 402 or stored in the database 406 . Further, as shown in FIG.
  • the audio processing system further includes a query terminal 404, the query terminal 404 can send a first query request to the server device 402, the first query request includes the user tag to be queried, and the server device 402 receives the first query request.
  • the text information corresponding to the user tag to be queried is obtained from the text information with the user tag and returned to the query terminal 404 .
  • the server device 402 after the server device 402 generates the audio signal with the user mark, it can output the audio signal with the user mark to the upper-layer application on the server device 402,
  • the upper-layer application can be a remote conference application or a social application, etc.
  • the upper-layer application can obtain user information in a multi-person speech scenario, for example, the user's identification information, such as name, nickname or voiceprint features, etc.
  • the upper-layer application can associate user information with Audio signals with user tags are correlated.
  • the association between the user information and the audio signal with the user mark is not limited.
  • the upper-layer application stores the correspondence between the user mark and the user information. Based on this correspondence, the user information corresponding to the user mark can be found, and the The user information is associated with an audio signal with user tags.
  • the query terminal 404 can send a second query request to the server device 402, the second query request includes the audio segment to be queried, and the server device 402 receives the second query request, from adding the user-marked audio
  • the user tag corresponding to the audio segment to be queried is extracted from the signal, and the user tag and/or user information corresponding to the user tag is returned to the query terminal 404 .
  • the audio processing system further includes a playback device 405, and the playback device 405 can send an audio signal acquisition request to the server device 402, and the server device 402 can add an audio signal based on the request
  • the user-marked audio signal is output to the playback device 405, and the playback device 405 receives and plays the user-marked audio signal.
  • the execution subject of each step of the method provided in the above-mentioned embodiments may be the same device, or the method may also be executed by different devices.
  • the execution body of steps 101a to 103a may be device A; for another example, the execution body of steps 101a and 102a may be device A, and the execution body of step 103a may be device B; and so on.
  • FIG. 5 is a schematic structural diagram of a sound pickup device provided by an exemplary embodiment of the present application. As shown in FIG. 5 , the sound pickup device includes: a processor 55 and a memory 54 .
  • Memory 54 stores computer programs and may be configured to store various other data to support operation on the pickup device. Examples of such data include instructions for any application or method operating on the pickup device, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 54 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • magnetic or Optical Disk any type of volatile or non-volatile storage device or combination thereof.
  • the processor 55 coupled with the memory 54, is used for executing the computer program in the memory 54, so as to: identify the speaker change point in the audio signal collected in the multi-person speech scene; convert the audio signal according to the speaker change point Divide into multiple audio clips, and extract the voiceprint features of the multiple audio clips; perform hierarchical clustering on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips to obtain the audio corresponding to the same speaker segment; add the same user tag to the audio segment corresponding to the same speaker to obtain a user-tagged audio signal.
  • the above process can be completed entirely on the pickup device, or some functions can be performed on the server device, for example, extracting the voiceprint features of multiple audio clips; according to the duration and voiceprint features of multiple audio clips, Perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user mark to the audio clips corresponding to the same speaker to obtain the part of the audio signal with the user mark added, which can be coordinated by the server Finish.
  • the processor 55 when the processor 55 performs hierarchical clustering on the plurality of audio clips according to the duration and voiceprint features of the plurality of audio clips to obtain the audio clips corresponding to the same speaker, the processor 55 is specifically used for: According to the duration of multiple audio clips, multiple audio clips are layered to obtain multi-layer audio clips corresponding to different duration ranges; The multi-layer audio clips are hierarchically clustered to obtain at least one clustering result, and each clustering result includes audio clips corresponding to the same speaker.
  • the processor 55 is specifically configured to: according to the durations of the multiple audio clips, when layering the multiple audio clips to obtain multi-layer audio clips corresponding to different duration ranges: The duration of the clip and the preset duration threshold of each layer are used to layer multiple audio clips to obtain multi-layer audio clips corresponding to different duration ranges; wherein, the smaller the number of layers, the larger the corresponding duration threshold, and The duration of the audio clips in each layer is greater than or equal to the duration threshold for that layer.
  • the processor 55 performs hierarchical clustering on the multi-layer audio clips according to the voiceprint features corresponding to the plurality of audio clips in the order from long to short in the duration range to obtain at least one clustering result.
  • it is specifically used for: for the audio clips in the first layer, cluster the audio clips in the first layer according to the voiceprint features corresponding to the audio clips in the first layer to obtain at least one clustering result;
  • the audio clips in one layer in the order of the number of layers from small to large, according to the voiceprint features corresponding to the audio clips in the non-first layer, the audio clips in the non-first layer are clustered to the existing clustering results.
  • clustering the remaining audio clips according to the voiceprint features corresponding to the remaining audio clips to generate a new clustering result , until each audio clip on all layers is clustered into a clustering result.
  • the processor 55 when the processor 55 identifies the speaker change point in the audio signal collected in the multi-person speech scene, the processor 55 is specifically configured to: perform sound source localization on the audio signal, so as to obtain the position of the sound source. Change point; according to the change point of the sound source position, determine the speaker change point in the audio signal; wherein, each audio segment segmented by the speaker change point corresponds to a unique sound source position.
  • the processor 55 performs hierarchical clustering on the multi-layer audio clips according to the voiceprint features corresponding to the plurality of audio clips in the order from long to short in the duration range to obtain at least one clustering result. is specifically used to: perform hierarchical clustering on the multi-layer audio clips in the order from long to short according to the corresponding voiceprint features and sound source positions of multiple audio clips, so as to obtain at least one clustering result, each The clustering results include audio clips corresponding to the same speaker.
  • the processor 55 performs hierarchical clustering on the multi-layer audio clips according to the voiceprint features and sound source positions corresponding to the plurality of audio clips in the order from long to short, to obtain at least one audio clip.
  • the clustering result is specifically used for: for the audio clips of the first layer, according to the voiceprint features and sound source positions corresponding to the audio clips in the first layer, the audio clips in the first layer are clustered to obtain at least one cluster.
  • Class results for the audio clips not in the first layer, in the order of the number of layers from small to large, according to the voiceprint features and sound source positions corresponding to the audio clips in the non-first layer, the audio clips in the non-first layer
  • the clips are clustered into the existing clustering results; and if there are remaining audio clips that are not clustered into the existing clustering results in the first layer, then according to the voiceprint features corresponding to the remaining audio clips and the sound source position pair
  • the remaining audio clips are clustered to produce new clustering results until each audio clip on all layers is clustered into one clustering result.
  • the processor 55 performs clustering on the audio clips in the first layer according to the voiceprint features and sound source positions corresponding to the audio clips in the first layer to obtain:
  • clustering result it is specifically used for: in the case that the first layer contains at least two audio clips, according to the voiceprint features and sound source positions corresponding to the at least two audio clips in the first layer, calculate the the overall similarity between the at least two audio clips; dividing the at least two audio clips in the first layer into at least one clustering result according to the overall similarity between the at least two audio clips in the first layer; and according to at least one
  • a clustering center of at least one clustering result is calculated respectively, and the clustering centers include the central voiceprint feature and the central sound source position.
  • the processor 55 is based on the voiceprint features and sound source positions corresponding to the audio clips in the non-first layer in order of the number of layers from small to large. , when the audio clips in the non-first layer are clustered to the existing clustering results, it is specifically used for: for each audio clip in any non-first layer, according to the corresponding voiceprint feature and The location of the sound source and the cluster center of the existing clustering results, and the overall similarity between the audio clip and the existing clustering results is calculated; if the overall similarity with the audio clip in the existing clustering results satisfies the set similarity Conditional target clustering result, add the audio segment to the target clustering result, and update the cluster center of the target clustering result according to the voiceprint feature and sound source position corresponding to the audio segment.
  • the processor 55 when updating the cluster center of the target clustering result according to the voiceprint feature and the sound source position corresponding to the audio segment, is specifically configured to: determine each audio frequency included in the target clustering result. The number of layers to which the clip belongs, where different layer numbers correspond to different weights, and the smaller the layer number, the greater the corresponding weight; according to the weight corresponding to the layer number to which each audio clip belongs, the corresponding voiceprint features and The sound source positions are weighted and summed respectively, and the new central voiceprint feature and the new central sound source position are obtained as the updated cluster center of the target clustering result.
  • the processor 55 performs hierarchical clustering on the multi-layer audio clips according to the voiceprint features and sound source positions corresponding to the plurality of audio clips in the order from long to short, to obtain at least one audio clip.
  • clustering the results it is specifically used to: cluster the audio clips in each layer according to the voiceprint features corresponding to the audio clips in each layer, and obtain the clustering results of each layer; according to the order of layers from small to large, According to the voiceprint feature of the clustering result of each layer, the clustering results of two adjacent layers are sequentially clustered to obtain at least one clustering result.
  • the processor 55 is further configured to: output the user-marked audio signal to the transcription device, so that the transcription device converts the user-marked audio signal into text information with the user mark; Or output the user-marked audio signal to the playback device so that the playback device can play the user-marked audio signal; or output the user-marked audio signal to the upper-layer application, so that the upper-layer application can obtain the user information corresponding to the user mark And associate with audio clips with user tags.
  • the processor 55 is further configured to: receive a first query request, where the first query request includes the user tag to be queried, and obtain a user tag corresponding to the user tag to be queried from the text information with the user tag. text information, and return it to the query end that initiated the first query request; or receive a second query request, the second query request includes the audio segment to be queried, and extract the audio segment corresponding to the audio segment to be queried from the audio signal added with the user mark.
  • User tag and return the user tag and/or user information corresponding to the user tag to the query end that initiates the second query request.
  • the sound pickup device further includes: a communication component 56 , a display 57 , a power supply component 58 , an audio component 59 and other components. Only some components are schematically shown in FIG. 5 , which does not mean that the sound pickup device only includes the components shown in FIG. 5 .
  • an embodiment of the present application further provides a computer-readable storage medium storing a computer program.
  • the processor can implement the sound pickup device in the method embodiment shown in FIG. 1a and FIG. 1b. steps to be performed.
  • an embodiment of the present application also provides a computer program product, including a computer program/instruction.
  • the processor can realize the sound pickup in the method embodiment shown in FIG. 1a and FIG. 1b. The steps performed by the device.
  • An embodiment of the present application further provides a sound pickup device, and the implementation structure of the sound pickup device is the same as or similar to that of the sound pickup device shown in FIG. 5 , and can be implemented with reference to the structure of the sound pickup device shown in FIG. 5 .
  • the difference between the sound pickup device provided in this embodiment and the sound pickup device in the embodiment shown in FIG. 5 is mainly that the functions implemented by the processor executing the computer program stored in the memory are different.
  • the processor of the processor executes the computer program stored in the memory, which can be used to: perform sound source localization on the audio signal collected in the multi-person speech scene, so as to obtain the change point of the sound source position.
  • the above process can be all completed on the sound pickup device, or some functions can be performed on the server device, for example, extracting the voiceprint features of multiple audio clips; position, clustering multiple audio clips to obtain the audio clips corresponding to the same speaker; adding the same user mark to the audio clips corresponding to the same speaker to obtain the audio signal with the user mark added, can be performed by the server device.
  • extracting the voiceprint features of multiple audio clips position, clustering multiple audio clips to obtain the audio clips corresponding to the same speaker
  • adding the same user mark to the audio clips corresponding to the same speaker to obtain the audio signal with the user mark added can be performed by the server device.
  • the cooperation is complete.
  • an embodiment of the present application further provides a computer-readable storage medium storing a computer program.
  • the processor can implement various functions that can be executed by the sound pickup device in the method embodiment shown in FIG. 1c . step.
  • an embodiment of the present application also provides a computer program product, including a computer program/instruction.
  • the processor can implement the method that can be executed by the sound pickup device in the method embodiment shown in FIG. 1c . each step.
  • FIG. 6 is a schematic structural diagram of a server device according to an exemplary embodiment of the present application. As shown in FIG. 6 , the server device includes: a processor 65 and a memory 64 .
  • Memory 64 stores computer programs and may be configured to store various other data to support operations on the server device. Examples of such data include instructions for any application or method operating on the server device, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 64 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • the processor 65 coupled with the memory 64, is used for executing the computer program in the memory 64, so as to: receive a plurality of audio clips sent by the sound pickup device and their corresponding voiceprint features;
  • the pattern feature is used to perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user mark to the audio clips corresponding to the same speaker to obtain the user marked audio signal.
  • the server device further includes: a communication component 66 , a power supply component 68 and other components.
  • FIG. 6 only schematically shows some components, which does not mean that the server device only includes the components shown in FIG. 6 .
  • the embodiment of the present application further provides a server device, the implementation structure of the server device is the same as or similar to that of the server device shown in FIG. 6 , and can be implemented with reference to the structure of the server device shown in FIG. 6 .
  • the difference between the server device provided in this embodiment and the server device in the embodiment shown in FIG. 6 is mainly that the functions implemented by the processor executing the computer program stored in the memory are different.
  • the server device For the server device provided in this embodiment, its processor executes the computer program stored in the memory, which can be used for: receiving multiple audio clips sent by the sound pickup device; extracting the voiceprint features corresponding to the multiple audio clips, according to multiple The duration and voiceprint characteristics of the audio clips, perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user tag to the audio clips corresponding to the same speaker to obtain user-marked audio clips. audio signal.
  • the computer program stored in the memory which can be used for: receiving multiple audio clips sent by the sound pickup device; extracting the voiceprint features corresponding to the multiple audio clips, according to multiple The duration and voiceprint characteristics of the audio clips, perform hierarchical clustering on multiple audio clips to obtain audio clips corresponding to the same speaker; add the same user tag to the audio clips corresponding to the same speaker to obtain user-marked audio clips. audio signal.
  • the embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the processor can implement the steps that can be executed by the server device in the embodiment of the audio signal processing method. .
  • an embodiment of the present application also provides a computer program product, including a computer program/instruction.
  • the processor can realize the above audio signal processing method embodiment that can be executed by the server device. each step.
  • an embodiment of the present application also provides a conference recording device
  • the conference recording device includes: a memory and a processor; the memory is used to store a computer program; A computer program for: collecting audio signals in a multi-person conference scene, identifying speaker change points in the audio signals; dividing the audio signal into a plurality of audio segments according to the speaker change points, and extracting voiceprint features of the multiple audio clips; performing hierarchical clustering on the multiple audio clips according to the duration and voiceprint features of the multiple audio clips to obtain audio clips corresponding to the same speaker; The same user mark is added to the audio segment corresponding to the same speaker to obtain a user mark added audio signal; conference record information is generated according to the user mark added audio signal, and the conference record information includes a conference identification.
  • An embodiment of the present application further provides a conference record presentation device, the conference record presentation device includes: a memory and a processor; the memory is used for storing a computer program; the processor is coupled with the processor, and is used for executing the computer program stored in the memory to It is used for: receiving a conference review request, where the conference review request includes a conference identifier to be presented; obtaining the conference record information to be presented according to the conference identifier; presenting the conference record information, the conference record information is based on multiple Generated by adding user-marked audio signals in the conference scene; wherein, among the multiple audio clips cut out according to the speaker change point in the audio signal, the audio clips corresponding to the same speaker are added with the same user mark, The audio clips corresponding to the same speaker are obtained by hierarchically clustering the multiple audio clips according to the duration and voiceprint features of the multiple audio clips.
  • an embodiment of the present application further provides a computer-readable storage medium storing a computer program.
  • the processor can implement each step in the method embodiment shown in FIG. 3d or FIG. 3e.
  • an embodiment of the present application also provides a computer program product, including a computer program/instruction.
  • the processor can implement each step in the method embodiment shown in FIG. 3d or FIG. 3e . .
  • the communication components in FIGS. 5 and 6 described above are configured to facilitate wired or wireless communication between the device where the communication component is located and other devices.
  • the device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or a combination thereof.
  • the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication assembly further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the above-mentioned display in FIG. 5 includes a screen, and the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor can sense not only the boundaries of a touch or swipe action, but also the duration and pressure associated with the touch or swipe action.
  • a power supply assembly may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the equipment in which the power supply assembly is located.
  • the audio components described above in FIG. 5 may be configured to output and/or input audio signals.
  • the audio component includes a microphone (MIC) that is configured to receive external audio signals when the device in which the audio component is located is in operating modes, such as call mode, recording mode, and speech recognition mode.
  • the received audio signal may be further stored in memory or transmitted via the communication component.
  • the audio assembly further includes a speaker for outputting audio signals.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include forms of non-persistent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种音频信号处理、会议记录与呈现方法、设备、系统及介质,该方法包括:对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点(101b);根据声源位置的变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征(102b);根据多个音频片段的时长、声纹特征和声源位置,对多个音频片段进行分层次聚类,得到对应同一发言人的音频片段(103b);为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号(104b)。

Description

音频信号处理、会议记录与呈现方法、设备、系统及介质
本申请要求2021年01月26日递交的申请号为202110105959.1、发明名称为“音频信号处理、会议记录与呈现方法、设备、系统及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频信号处理技术领域,尤其涉及一种音频信号处理、会议记录与呈现方法、设备、系统及介质。
背景技术
在会议、庭审现场等多人发言场景中,为了满足对会议内容进行记录的需求,通常采用一些具有语音采集功能的产品,例如拾音器、录音笔等,实时采集多人发言场景中的语音信号。基于这些产品采集的语音信号,可以直接基于语音信号查询多人发言场景中的发言内容,或者,也可以将语音信号转写为文字后进行查询。
为了便于在查询时能够了解发言内容对应的发言人信息,在采集到语音信号之后,还需要对发言人进行识别,即识别出“哪些发言内容是哪个发言人说的”。在现有技术中,采用神经网络模型提取语音信号中的声纹特征,根据声纹特征来区分同一发言人对应的发言内容。
但是,在实际应用中,多人发言场景中可能存在较强的噪音干扰,发言人的声纹特征也可能受情绪影响而发生变化,这些会导致基于声纹特征的识别结果存在误判,识别准确率较低。
发明内容
本申请的多个方面提供一种音频信号处理、会议记录与呈现方法、设备、系统及介质,用以能够更加准确地识别同一发言人对应的音频片段,提高识别的效率。
本申请实施例提供一种音频信号处理方法,包括:识别在多人发言场景中采集到的音频信号中的发言人变更点;根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种音频信号处理方法,包括:对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段;为对应同一 发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种会议记录方法,包括:采集多人会议场景中的音频信号,识别所述音频信号中的发言人变更点;根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,根据添加用户标记的音频信号生成会议记录信息,所述会议记录信息包括会议标识。
本申请实施例还提供一种会议记录呈现方法,包括:接收会议查阅请求,所述会议查阅请求包含待呈现的会议标识;根据所述会议标识,获取待呈现的会议记录信息;呈现所述会议记录信息,所述会议记录信息是根据多人会议场景中添加用户标记的音频信号生成的;其中,根据所述音频信号中的发言人变更点所切分出的多个音频片段中,对应同一发言人的音频片段添加有相同的有用户标记,对应同一发言人的音频片段是根据所述多个音频片段的时长和声纹特征对所述多个音频片段进行分层次聚类得到的。本申请实施例还提供一种音频处理系统,包括:拾音设备和服务端设备;拾音设备部署在多人发言场景中,用于采集多人发言场景中的音频信号,识别音频信号中的发言人变更点,根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段对应的声纹特征;服务端设备,用于根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种音频处理系统,包括:拾音设备和服务端设备;拾音设备部署在多人发言场景中,用于采集多人发言场景中的音频信号,识别音频信号中的发言人变更点,根据发言人变更点将音频信号切分为多个音频片段;服务端设备,用于提取多个音频片段对应的声纹特征,根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种拾音设备,包括:处理器和存储器;存储器,用于存储计算机程序;处理器与存储器耦合,用于执行计算机程序,以用于:识别在多人发言场景中采集到的音频信号中的发言人变更点;根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种拾音设备,包括:处理器和存储器;存储器,用于存储计算机程序;处理器与存储器耦合,用于执行计算机程序,以用于:对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点将 音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种服务端设备,包括:处理器和存储器;存储器,用于存储计算机程序;处理器与存储器耦合,用于执行计算机程序,以用于:接收拾音设备发送的多个音频片段及其对应的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种服务端设备,包括:处理器和存储器;存储器,用于存储计算机程序;处理器与存储器耦合,用于执行计算机程序,以用于:接收拾音设备发送的多个音频片段;提取多个音频片段对应的声纹特征,根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,当计算机程序被处理器执行时,致使处理器实现本申请实施例提供的各方法中的步骤。
本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,当所述计算机程序/指令被处理器执行时,致使处理器实现本申请实施例提供的各方法中的步骤。
在本申请实施例中,针对多人发言场景的音频信号,先基于发言人变更点将音频信号切为多个音频片段,再根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,识别出对应同一发言人的音频片段并添加用户标记。其中,不再单纯利用声纹特征进行聚类,而是结合了音频片段的时长和声纹特征进行分层次聚类,分层次聚类可以先对声纹特征更加稳定的音频片段进行聚类,相比于同时对所有音频片段进行聚类,分层次聚类可以减少声纹特征不稳定的音频片段带来的误差,能够更加准确地识别同一发言人对应的音频片段,提高识别的效率,用户标记结果更加准确。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1a为本申请示例性实施例提供的一种音频信号处理方法的流程示意图;
图1b为本申请示例性实施例提供的另一种音频信号处理方法的流程示意图;
图1c为本申请示例性实施例提供的又一种音频信号处理方法的流程示意图;
图2a为对每层中的音频片段进行聚类的示意图;
图2b为对每层中的音频片段进行聚类的示意图;
图2c为对第一层中的音频片段进行聚类的示意图;
图3a为拾音设备在多人会议场景下的使用状态示意图;
图3b为拾音设备在商务合作商谈场景下的使用状态示意图;
图3c为拾音设备在教学场景下的使用状态示意图;
图3d为本申请示例性实施例提供的一种会议记录方法的流程示意图;
图3e为本申请示例性实施例提供的一种会议记录呈现方法的流程示意图;
图4a为本申请示例性实施例提供的一种音频处理系统的结构示意图;
图4b为本申请示例性实施例提供的另一种音频处理系统的结构示意图;
图5为本申请示例性实施例提供的一种拾音设备的结构示意图;
图6为本申请示例性实施例提供的一种服务端设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
针对实际应用中,多人发言场景中可能存在较强的噪音干扰,发言人的声纹特征也可能受情绪影响而发生变化,这些会导致基于声纹特征的识别结果存在误判,识别准确率较低的技术问题。针对该问题,在本申请一些实施例中,针对多人发言场景的音频信号,先基于发言人变更点将音频信号切为多个音频片段,再根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,识别出对应同一发言人的音频片段并添加用户标记。其中,不再单纯利用声纹特征进行聚类,而是结合了音频片段的时长和声纹特征进行分层次聚类,分层次聚类可以先对声纹特征更加稳定的音频片段进行聚类,相比于同时对所有音频片段进行聚类,分层次聚类可以减少声纹特征不稳定的音频片段带来的误差,能够更加准确地识别同一发言人对应的音频片段,提高识别的效率,用户标记结果更加准确。
以下结合附图,详细说明本申请各实施例提供的技术方案。
图1a为本申请示例性实施例提供的一种音频信号处理方法的流程示意图。如图1a所示,该方法包括:
101a、识别在多人发言场景中采集到的音频信号中的发言人变更点;
102a、根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;
103a、根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;
104a、为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的 音频信号。
在本实施例中,发言人变更点是指音频信号中区分不同发言人的位置点,也就是发言人变更事件的发生位置,其数量可以是1个,也可以是多个,例如2个,3个或者5个。在本实施例中,并不限定发言人变更点的识别方式,下面举例说明。
例如,可以通过语音端点检测(Voice Activity Detection,VAD)技术,识别音频信号中的发言人变更点。VAD中的端点是指静音和有效语音信号变化临界点。对于在多人发言场景中采集到的音频信号,采用VAD技术,可以找出每一语音段对应的起点、尾点,区分出语音时段与非语音时段,而且还可以去除静音、噪音等。在本实施例中,可结合这些起点、尾点之间的停顿时长,确定发言人变更点。例如,对于起点、尾点之间停顿时间间隔大于设定阈值的情况,可以将该情况下的语音端点(即起点和尾点)位置视为发言人变更点。
又例如,还可以对在多人发言场景中采集到的音频信号进行声纹特征提取,根据音频信号中声纹特征的变化,将音频信号中声纹发生变化的位置点作为发言人变更点。或者,可以将VAD技术与声纹特征相结合,针对VAD检测出的每一语音时段对应的起点、尾点,进一步结合相邻起点和尾点处的声纹特征,若相邻起点和尾点处的声纹特征发生变化,则可确定该语音端点(即起点和尾点)位置为发言人变更点。
又例如,在采集音频信号时,可以基于麦克风阵列对音频信号进行声源定位,以得到声源位置的变更点,根据声源位置的变更点,可以确定音频信号中的发言人变更点。例如,在每个发言人位置固定不变的发言场景中,可以将声源位置的变更点作为发言人变更点。
当然,在一些发言场景中,发言人可能会走动,即其发言位置不是固定的,对于这种情况,可以将声源定位与VAD技术相结合,利用声源定位技术定位声源位置的变更点,利用VAD技术确定出音频信号中,每一语音时段对应的起点、尾点;根据VAD确定出的起点、尾点对声源位置的变更点进行修正,从而得到准确地的发言人变更点。具体地,可以将声源位置的变更点与VAD检测结果在时间轴上进行对齐,判断该声源位置的变更点前后一定时间内,是否存在检测出的语音端点,例如起点或尾点,如果存在,则可以将该语音端点所在的位置确定为发言人变更点。通过上述方式,可以更准确地确定发言人变更点,进而可以更准确地对语音识别结果进行截断,避免出现丢字首、丢字尾等现象。
在本实施例中,可以根据发言人变更点将音频信号切分为多个语音片段,例如,对于一段音频信号来说,将其开始位置记为A1,将其结束位置记为A2,在识别出该音频信号包含有一个发言人变更点B1的情况下,根据该发言人变更点B1可将该音频信号切分为音频片段A1—>B1和音频片段B1—>A2。
在本实施例中,根据发言人变更点切分出多个音频片段之后,可以提取多个音频片 段的声纹特征,声纹特征可以用特征向量来表示。声纹特征是音频片段的特征体现,对应不同发言人的音频片段的声纹特征一般不同。在本实施例中,并不限定提取多个语音片段声纹特征的实施方式。例如,可以预先训练用于提取声纹特征的神经网络模型,采用预先训练出的神经网络模型来提取多个语音片段的声纹特征,其中,神经网络模型可以是但不限于:基于梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,简称MFCC)的模型或者高斯混合-通用背景模型(Gaussian Mixture Model-Universal Background Model,GMM-UBM)等。
在本实施例中,被发言人变更点切分出的多个音频片段中,每个音频片段均对应一个发言人,不同的音频片段可能对应于同一个发言人,也可能对应于不同发言人。在需要为音频片段添加用户标记的应用中,需要识别对应同一发言人的音频片段,从而为对应同一发言人的音频片段添加相同用户标记。为了更加准确地识别同一发言人对应的语音片段,在本实施例中,可以以多个音频片段的声纹特征为基础,对多个音频片段进行聚类,尽量将声纹特征相同的音频片段聚类在一起。在本实施例中,将声纹特征相同或相近的音频片段视为同一用户对应的音频片段。另外,由于发言人说话习惯、方式和特殊需求等因素,多人发言场景中可能存在特别短的发言,例如嗯、啊、是的,好等,这样被切分出的音频片段中可能存在一些较短的音频片段。音频片段的时长越长,其对应的声纹特征也就越稳定,反之,音频片段的时长越短,其对应的声纹特征的稳定性也就会降低,区分性也就没那么明显了。例如,对于用户A说的“啊”和用户B说的“啊”在声纹特征上区别不是很明显。鉴于此,在本实施例中,进一步考虑音频片段的时长,结合多个音频片段的时长对多个音频片段进行分层次聚类,分层次聚类是指对多个音频片段进行分层后,再对分层后的音频片段逐层进行聚类的过程,用以充分发挥较长的音频片段的优势,降低较短音频片段可能造成的干扰。因此,在本实施例中,在得到多个音频片段及其声纹特征之后,根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段。
由于音频片段的时长越长,其对应的声纹特征也就越稳定,基于此,在本申请一些可选实施例中,可以根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;根据多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。在分层次聚类中,不单单利用声纹特征对多个音频片段进行聚类,而是结合了音频片段的时长,先根据声纹特征对时长范围较长的音频片段进行聚类,再根据声纹特征对时长较短的音频片段进行聚类,在对时长较短的音频片段进行聚类的过程中,需要判断时长较短的音频片段是否属于前面由时长较长的音频片段聚类出的结果,在不属于的情况下可建立新的聚类结果,以此类推完成所有层上音频片段的聚类,这样按照时长由长到短的顺序分层聚类,可以时长较长的音频片段的聚类 结果为主,减少了由于时长较短音频片段发的声纹特征不稳定,而带来的识别误差,提高了识别对应同一发言人的音频片段的准确率。
在本实施例中,得到对应同一发言人的音频片段之后,可以为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。在本实施例中,并不限定添加用户标记的实施方式。例如,可以在每一音频片段之前,插入带有用户标记的语音片段,例如,在对应于用户C1的音频片段之前,可以插入语音片段“用户C1请发言”。又例如,在音轨上为对应同一发言人的音频片段添加相同的用户标记点,例如,对应发言人C2的音频片段添加红色标记点,对应发言人C3的音频片段添加绿色标记点,对应发言人E的音频片段添加黄色标记点等。
在本申请实施例中,针对多人发言场景的音频信号,先基于发言人变更点将音频信号切为多个音频片段,再根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,识别出对应同一发言人的音频片段并添加用户标记。其中,不再单纯利用声纹特征进行聚类,而是结合了音频片段的时长和声纹特征进行分层次聚类,分层次聚类可以先对声纹特征更加稳定的音频片段进行聚类,相比于同时对所有音频片段进行聚类,分层次聚类可以减少声纹特征不稳定的音频片段带来的误差,能够更加准确地识别同一发言人对应的音频片段,提高识别的效率,用户标记结果更加准确。
在本实施例中,并不限定根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段的实施方式。在一可选实施例中,可以设定各层的数量阈值,将多个音频片段按照时长进行排序,将排序后的多个音频片段,按照预先设定的各层的数量阈值进行分层,以得到多层音频片段。在又一可选实施例中,可以预先设定的各层的时长阈值,根据多个音频片段的时长和预先设定的各层的时长阈值,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;其中,层数越小,对应的时长阈值越大,且每层中音频片段的时长大于或等于该层的时长阈值。例如,可以将时长超过20s的音频片段划分为第一层,将时长为10s~20s的音频片段划分为第二层,将时长为5s~10s的音频片段划分为第三层,将时长小于5s的音频片段划分为第四层。
在本实施例中,得到多层音频片段之后,并不限定对多层音频片段进行分层次聚类,以得到至少一个聚类结果的实施方式。下面详细说明。
在一可选实施例中,可以对根据每层中音频片段对应的声纹特征,对每层中的音频片段进行聚类,得到每层的聚类结果;按照层数由小到大的顺序,根据每层聚类结果的声纹特征,依次对相邻两层的聚类结果进行聚类,以得到至少一个聚类结果。其中,每层的聚类结果可以是一个,也可以是多个,例如,2个、3个或者5个等,对此不做限定。如图2a所示,根据多个音频片段的时长,将多个音频片段分为三层,根据每层音频片段的声纹特征,对每层音频片段进行聚类,得到每层的聚类结果,第一层有两个聚类结果 D1和D2,第二层有三个聚类结果D3、D4和D5,第三层有两个聚类结果D6和D7;接着根据第二层聚类结果的声纹特征,将第二层的聚类结果向第一层的聚类结果D1或D2进行聚类,其中,可以根据聚类结果D3、D4和D5的声纹特征,判断聚类结果D3、D4和D5是否可以聚类到聚类结果D1和D2中,假设聚类结果D3和D4可以聚类到聚类结果D1中,得到聚类结果E1,聚类结果D5可以聚类到聚类结果D2中,得到聚类结果E2,这样第二层的聚类结果向第一层的聚类结果进行聚类后,可以得到两个聚类结果E1和E2;最后,根据第三层聚类结果的声纹特征,将第三层的聚类结果向已有聚类结果E1和E2进行聚类,其中,可以根据聚类结果D6和D7的声纹特征,判断聚类结果D6和D7是否可以聚类到聚类结果E1或E2中,假设聚类结果D6可以聚类到聚类结果E1中,得到聚类结果E3,聚类结果D7可以聚类到聚类结果E2,得到聚类结果E4,最终得到两个聚类结果E3和E4,也即得到两个发言人对应的音频片段。
在另一可选实施例中,先对第一层的音频片段进行聚类,然后以第一层的聚类结果为基础,按照层次由小到大的顺序,将每一层的音频片段都向已有的聚类结果中进行聚类。具体地,首先对于第一层中的音频片段,根据第一层中音频片段对应的声纹特征,对第一层中的音频片段进行聚类,得到至少一个聚类结果;然后,对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征,将非第一层中的音频片段向已有的聚类结果进行聚类;以及若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据剩余音频片段对应的声纹特征对剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到一个聚类结果中为止。下面以音频片段被切分为三层为例,对整个分层聚类过程进行举例说明。
如图2b所示,首先,根据第一层音频片段的声纹特征,对第一层的音频片段进行聚类得到两个聚类结果F1和F2,聚类结果F1中包含音频片段g1和音频片段g2,聚类结果F2中包含音频片段g3;接着,将第二层的音频片段聚类向第一层已有的聚类结果F1和F2进行聚类,其中,第二层包含三个音频片段,分别为音频片段g4、音频片段g5和音频片段g6,则根据音频片段g4、音频片段g5和音频片段g6的声纹特征,判断音频片段g4、音频片段g5和音频片段g6是否可以聚类到聚类结果F1或F2中,假设音频片段g5和音频片段g6聚类到聚类结果F2中,音频片段g4无法聚类到第一层的聚类结果F1和F2中,则将音频片段g4单独作为一个聚类结果F3,这样,将第二层的音频片段向第一层的聚类结果进行聚类之后得到三个聚类结果F1、F2和F3;最后,将第三层的聚类结果向已有的聚类结果F1、F2和F3进行聚类,其中,第三层包含两个音频片段,分别为音频片段g7和音频片段g8;可以根据音频片段g7和音频片段g8的声纹特征,判断音频片段g7和音频片段g8是否可以聚类到已有的聚类结果F1、F2或F3中,假设将音频片段g7聚类到聚类结果F1中,将音频片段g8聚类到聚类结果F2中;最后可以得到三个聚类结果F1、F2和F3,也即三个发言人对应的音频片段。
在本实施例中,并不限定对多个音频片段进行聚类的实施方式,例如可以采用但不限定于:K均值(K-Means)聚类、均值漂移聚类、基于密度的聚类(DBSCAN)、用高斯混合模型(GMM)的最大期望(EM)聚类、凝聚层次聚类或者图团体检测(Graph Community Detection)聚类等。
在本实施例中,并不限定对于第一层的音频片段,根据第一层中音频片段对应的声纹特征,对第一层中的音频片段进行聚类,得到至少一个聚类结果的实施方式。一种根据第一层中音频片段对应的声纹特征,对第一层中的音频片段进行聚类,得到至少一个聚类结果的实施方式,包括:在第一层至少包含两个音频片段的情况下,根据第一层中至少两个音频片段对应的声纹特征,计算第一层中至少两个音频片段之间的整体相似度,可选地,可以将至少两个音频片段之间的声纹特征相似度作为至少两个音频片段之间的整体相似度;根据第一层中至少两个音频片段之间的整体相似度,将第一层中至少两个音频片段划分至少一个聚类结果中;以及根据至少一个聚类结果中包含的音频片段对应的声纹特征,分别计算至少一个聚类结果的聚类中心,聚类中心包括中心声纹特征。具体地,对于第一层中的任一音频片段,可以根据该音频片段对应的声纹特征和第一层中其它音频片段对应的声纹特征,计算该音频片段与第一层中其它音频片段的整体相似度;若第一层中其它音频片段中存在与该音频片段的整体相似度满足设定相似度条件的目标音频片段,则将该音频片段与目标音频片段进行聚类,得到一个目标聚类结果,并根据该音频片段与目标音频片段对应的声纹特征更新该目标聚类结果的聚类中心。进一步,可以计算该目标聚类结果与第一层中剩余音频片段是否可以聚类,对于无法聚类到目标聚类结果的剩余音频片段,可以根据剩余音频片段对应的声纹特征,对剩余音频片段进行聚类,以产生新的聚类结果,直至所有第一层上的每个音频片段均被聚类到一个聚类结果中为止。
例如,第一层中包含三个音频片段的情况下,三个音频片段分别为音频片段h1、音频片段h2以及音频片段h3,可以先计算音频片段h1和音频片段h2的声纹特征相似度,将该声纹特征相似度作为两个音频片段h1和h2之间的整体相似度,若该整体相似度满足设定条件,则认为音频片段h1和音频片段h2来自于同一发言人,可以将音频片段h1和音频片段h2聚类一个聚类结果H1中,并计算该聚类结果H1的聚类中心,也即中心声纹特征,例如,可以直接将音频片段h1的声纹特征作为中心声纹特征,也可以将音频片段h2的声纹特征作为中心声纹特征,还可以对音频片段h1的声纹特征和音频片段h2的声纹特征取平均得到中心声纹特征,对此不做限定;在获取到聚类结果H1之后,可以计算聚类结果H1的中心声纹特征和音频片段h3的声纹特征的相似度,将该声纹特征的相似度作为聚类结果H1与音频片段h3之间的整体相似度,若该整体相似度满足设定条件,则认为聚类结果H1与音频片段h3来自于同一发言人,则可以将聚类结果H1与音频片段h3聚类到一个聚类结果H2,并根据聚类结果H1与音频片段h3的声纹特征, 计算聚类结果H2的中心声纹特征;若该相似度阈值不满足设定条件,则认为聚类结果H1与音频片段h3不是来自于同一发言人,则可以将音频片段h3单独作为一个聚类结果H3。
在本实施例中,也不限定对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征,将非第一层中的音频片段向已有的聚类结果进行聚类的实施方式,例如,对任意一个非第一层中的每个音频片段,根据该音频片段对应的声纹特征和已有聚类结果的聚类中心,计算该音频片段与已有聚类结果的整体相似度;若已有聚类结果中存在与该音频片段的整体相似度满足设定相似度条件的目标聚类结果,将该音频片段加入目标聚类结果中,并根据该音频片段对应的声纹特征更新目标聚类结果的聚类中心。
在本实施例中,并不限定根据音频片段对应的声纹特征更新目标聚类结果的聚类中心的实施方式,在一可选实施例中,直接对目标聚类结果中包含的各音频片段的声纹特征取平均,得到新的中心声纹特征作为目标聚类结果更新后的聚类中心。在另一可选实施例中,确定目标聚类结果中包含的各音频片段所属的层数,不同层数设定不同的权重,且层数越小,对应的权重越大;根据各音频片段所属的层数对应的权重,对各音频片段对应的声纹特征进行加权求和,得到新的中心声纹特征作为目标聚类结果更新后的聚类中心。例如,目标聚类结果中包含有第一层的音频片段j1和音频片段j2,第二层的音频片段j3,计算聚类中心时,为第一层的音频片段设定权重为k1,为第二层的音频片段设定权重为k2,k1>k2且k1+k2=1,则目标聚类结果的中心声纹特征为:(j1的声纹特征)*k1+(j2的声纹特征)*k1+(j3的声纹特征)*k2。
在本申请实施例中,由于具体的多人发言场景中,例如,多人会议等,具体发言人通常可以在自己的座位等处进行发言,在会议过程中,发言人的位置通常不会发生变化,因此,可以通过识别出声源方向的突变,来判断是否存在发言人变更的事件。基于此,本申请实施例还提供的一种音频信号处理方法,如图1b所示,该方法包括:
101b、对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;
102b、根据所述声源位置的变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;
103b、根据多个音频片段的时长、声纹特征和声源位置,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;
104b、为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
在本实施例中,对音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点切分音频信号,以得到多个音频片段,每个音频片段对应有唯一的声源位置, 对于发言人的位置不发生变化的情况,可认为每个声源位置对应有一个发言人,也即每个音频片段对应有一个发言人,对于发言人的位置发生变化的情况,发言人的音频片段可能会分为两个音频片段,一个音频片段对应于位置变化前,一个音频片段对应于位置变化后,此时,每个音频片段也对应有一个发言人。
在本实施例中,将音频片段按照声源位置的变更点切分为多个音频片段之后,可以提取多个音频片段的声纹特征,关于提取声纹特征的实施方式,可参见前述实施例,在此不再赘述。
在本实施例中,被声源位置的变更点切分出的多个音频片段中,每个音频片段均对应一个发言人,不同的音频片段可能对应于同一个发言人,也可能对应于不同发言人。在需要为音频片段添加用户标记的应用中,需要识别对应同一发言人的音频片段,从而为对应同一发言人的音频片段添加相同用户标记。为了更加准确地识别同一发言人对应的语音片段,在本实施例中,可以以多个音频片段的声纹特征为基础,对多个音频片段进行聚类,尽量将声纹特征相同的音频片段聚类在一起。在本实施例中,将声纹特征相同或相近的音频片段视为同一用户对应的音频片段。进一步,还可以结合音频片段对应的声源位置,如果两个音频片段的声纹特征相同或相似且来自同一声源位置,则这两个音频片段对应同一用户的概率会更高。另外,考虑到音频片段的时长越长,其对应的声纹特征也就越稳定,反之,音频片段的时长越短,其对应的声纹特征的稳定性也就会降低,区分性也就没那么明显了。因此,在本实施例中,进一步考虑音频片段的时长,结合多个音频片段的时长对多个音频片段进行分层次聚类,分层次聚类是指对多个音频片段进行分层后,再对分层后的音频片段逐层进行聚类的过程,用以充分发挥较长的音频片段的优势,降低较短音频片段可能造成的干扰。因此,在本实施例中,在得到多个音频片段及其声纹特征和声源位置之后,根据多个音频片段的时长、声纹特征以及声源位置,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段。
在本申请一可选实施例中,一种根据多个音频片段的时长、声纹特征以及声源位置,对多个音频片段进行分层次聚类的实施方式,包括:根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;根据多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
其中,根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段的实施方式,可参见前述实施例,在此不再赘述。在本实施例中,并不限定根据多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果的实施方式。下面举例说明。
在一可选实施例中,可以根据每层中音频片段对应的声纹特征和声源位置,对每层中的音频片段进行聚类,得到每层的聚类结果;按照层数由小到大的顺序,根据每层聚 类结果的声纹特征和声源位置,依次对相邻两层的聚类结果进行聚类,以得到至少一个聚类结果。
在另一可选实施例中,可以先对第一层的音频片段进行聚类,然后以第一层的聚类结果为基础,按照层次由小到大的顺序,将每一层的音频片段都向已有的聚类结果中进行聚类。具体地,首先对于第一层的音频片段,根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类,得到至少一个聚类结果;然后,对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征和声源位置,将非第一层中的音频片段向已有的聚类结果进行聚类;以及若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据剩余音频片段对应的声纹特征和声源位置对剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到至少一个聚类结果中为止。
在本申请一可选实施例中,根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类的实施方式,包括:如果第一层只包括一个音频片段,则该音频片段自己形成一个聚类结果;如果第一层上至少包含两个音频片段,则在第一层至少包含两个音频片段的情况下,根据第一层中至少两个音频片段对应的声纹特征和声源位置,计算第一层中至少两个音频片段之间的整体相似度。例如,可以先计算至少两个音频片段的声纹特征相似度,再计算至少两个音频片段的声源位置相似度,对声纹特征相似度和声源位置相似度进行加权,得到第一层中至少两个音频片段之间的整体相似度。进一步,可以根据第一层中至少两个音频片段之间的整体相似度,将第一层中至少两个音频片段划分至少一个聚类结果中。进一步,还需要根据至少一个聚类结果中包含的音频片段对应的声纹特征和声源位置,分别计算至少一个聚类结果的聚类中心,该聚类中心包括中心声纹特征和中心声源位置,为非第一层上的音频片段向所述至少一个聚类结果进行聚类提供基础。例如,对每个聚类结果,可以将该聚类结果中包含的音频片段对应的声纹特征的平均值作为该聚类结果的中心声纹特征,将该聚类结果中包含的音频片段对应的声源位置的平均值作为该聚类结果的中心声源位置。又例如,可以直接将该聚类结果中包含的任一音频片段的声纹特征作为该聚类结果的中心声纹特征,将该聚类结果中包含的任一音频片段的声源位置作为该聚类结果的中心声纹位置。
如图2c所示,第一层的音频片段包括:音频片段m1-音频片段m6,可以计算任意两个音频片段之间的整体相似度,将整体相似度高于设定相似度阈值(例如,90%)的两个音频片段进行聚类,例如,音频片段m1与音频片段m3的整体相似度阈值为91%,音频片段m2与音频片段m4的整体相似度阈值为93%,音频片段m3与音频片段m6的整体相似度阈值为95%,则可以将音频片段m1与音频片段m3进行聚类得到聚类结果M1,音频片段m2与音频片段m4进行聚类得到聚类结果M2,将音频片段m3与音频片段m6进行聚类得到聚类结果M3,分别计算聚类结果M1、聚类结果M2以及聚类结果 M3的聚类中心,根据两两聚类结果的聚类中心,计算两个聚类结果的整体相似度,若该整体相似度超过设定阈值(例如90%),则将继续将两个聚类结果进行聚类。例如,聚类结果M1和聚类结果M2的整体相似度为90%,聚类结果M1和聚类结果M3的整体相似度为85%,聚类结果M2和聚类结果M3的整体相似度为80%,则将聚类结果M1和聚类结果M2继续聚类为聚类结果M4,将聚类结果M3单独作为一个聚类结果,最终,第一层的音频片段得到两个聚类结果M3和M4。
进一步可选地,对任意一个非第一层中的每个音频片段,一种将其向已有聚类结果进行聚类的过程包括:根据该音频片段对应的声纹特征和声源位置和已有聚类结果的聚类中心,计算该音频片段与已有聚类结果的整体相似度;若已有聚类结果中存在与该音频片段的整体相似度满足设定相似度条件的目标聚类结果,则可以认为该音频片段与目标聚类结果中的音频片段来自于同一发言人,将该音频片段加入目标聚类结果中,并根据该音频片段对应的声纹特征和声源位置更新目标聚类结果的聚类中心。
对目标聚类结果,在有新的音频片段加入该目标聚类结果时,可以采用但不限于下述方式更新该目标聚类结果的聚类中心。例如,可以对目标聚类结果中包含的所有音频片段的声纹特征取平均,将平均值作为目标聚类结果的聚类中心的中心声纹特征;对目标聚类结果中音频片段的声源位置取平均,将平均值作为目标聚类结果的聚类中心的中心声源位置。又例如,可以确定目标聚类结果中包含的各音频片段所属的层数,为不同层数设定不同的权重,且层数越小,对应的权重越大;根据各音频片段所属的层数对应的权重,对目标聚类结果中包含的各音频片段对应的声纹特征进行加权求和,得到新的中心声纹特征;根据各音频片段所属的层数对应的权重,对目标聚类结果中包含的各音频片段对应的声源位置进行加权求和,得到新的中心声源位置;新的中心声纹特征和新的中心声源位置形成目标聚类结果更新后的聚类中心。
在本申请实施例中,针对多人发言场景的音频信号,先基于声源位置将音频信号切为多个音频片段,再根据多个音频片段的时长、声纹特征以及声源位置,对多个音频片段进行分层次聚类,识别出对应同一发言人的音频片段并添加用户标记。其中,不再单纯利用声纹特征进行聚类,而是将声源位置、声纹特征以及分层次聚合进行结合,其中,声源位置可以准确地对音频信号进行分段,分层次聚合可以减少短语音对识别结果的影响,在此基础上,再利用声纹特征识别同一发言人对应的音频片段,可大幅提高识别的效率,用户标记结果更加准确。
本实施例还提供一种音频信号处理方法,如图1c所示,该方法包括:
101c、对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;
102c、根据声源位置的变更点将音频信号切分为多个音频片段,并提取多个音频片 段的声纹特征;
103c、根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段;
104c、为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
其中,根据声源位置的变更点将音频信号切分为多个音频片段,包括:将声源位置的变更点作为发言人变更点,从而将音频信号切分为多个音频片段;或者,结合VAD技术,利用VAD技术检测出该音频信号的起点、尾点;根据起点、尾点对声源位置的变更点进行修正,得到发言人变更点,进而根据发言人变更点,从而将音频信号切分为多个音频片段。
在一可选实施例中,根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段,包括:根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;根据多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对多层音频片段进行层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。关于本实施例中,各步骤的详细描述可参见前述实施例,在此不再赘述。
在本申请实施例中,针对多人发言场景的音频信号,先基于发言人变更点将音频信号切为多个音频片段,再根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,识别出对应同一发言人的音频片段并添加用户标记。其中,不再单纯利用声纹特征进行聚类,而是结合了音频片段的时长和声纹特征进行分层次聚类,分层次聚类可以先对声纹特征更加稳定的音频片段进行聚类,相比于同时对所有音频片段进行聚类,分层次聚类可以减少声纹特征不稳定的音频片段带来的误差,能够更加准确地识别同一发言人对应的音频片段,提高识别的效率,用户标记结果更加准确。
本申请各实施例提供的音频信号处理方法,可应用于到各种多人发言场景中,例如多人会议场景、商务会谈场景或者教学场景等。在这些应用场景中,本实施例的拾音设备会被部署在这些场景中,用于采集多人发言场景中的音频信号,并实现本申请上述各方法实施例以及下述系统实施例中所描述的其它功能。为了有更好的采集效果,便于对音频信号进行声源定位,可以根据多人发言场景的具体部署情况合理确定拾音设备的放置位置。如图3a所示,在多人会议场景中,拾音设备部署在会议桌的中央,多个发言人分布在拾音设备的不同方位,方便拾取每个发言人的语音;如图3b所示,在商务合作会谈场景下,第一商务方和第二商务方相对落座,会议组织方位于第一商务方和第二商务方之间,负责组织两方商谈,拾音设备部署在会议组织方、第一商务方、第二商务方的中心位置,第一商务方、第二商务方和会议组织方拾音设备的不同方位上,方便拾音设备拾音;如图3c所示,在教学场景中,拾音设备部署在讲课桌上,教师与学生位于拾音 设备的不同方位上,方便同时拾取教师与学生的语音。
以上述实施例提供的音频信号处理方法在多人会议场景中的应用为例,则可以针对多人会议场景进行会议记录,进一步还可以针对会议记录呈现或再现。如图3d所示,本申请示例性实施例提供的一种会议记录方法,包括以下步骤:
301d、采集多人会议场景中的音频信号,识别所述音频信号中的发言人变更点;
302d、根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;
303d、根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;
304d、为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号;
305d、根据添加用户标记的音频信号生成会议记录信息,所述会议记录信息包括会议标识。
关于步骤301d-304d的详细描述可参见前述实施例,在此不再赘述。在本实施例中,重点针对步骤305d进行描述。具体地,在得到添加用户标记的音频信号之后,可以根据该添加用户标记的音频信号生成会议记录信息,并为该会议记录信息添加对应的会议标识,该会议标识具有唯一性,可唯一标识一场多人会议。在一可选实施例中,可以直接将添加用户标记的音频信号作为会议记录信息。在另一可选实施例中,可以将添加用户标记的音频信号转换为带有发言人信息的文本信息,该文本信息中可以包括类似但不限于下述格式的内容:A发言人:xxxx;B发言人:yyy等;之后将带有发言人信息的文本信息作为会议记录信息。无论是哪种形式的会议记录信息,基于会议记录信息,可以再现会议场景,便于进行会议内容的查询或查阅。
图3e为本申请示例性实施例提供的一种会议记录呈现方法的流程示意图,如图3e所示,该方法包括:
301e、接收会议查阅请求,该会议查阅请求包括待呈现的会议标识;
302e、根据上述会议标识,获取待呈现的会议记录信息;
303e、呈现会议记录信息,该会议记录信息是根据多人会议场景中添加用户标记的音频信号生成的;其中,根据音频信号中的发言人变更点所切分出的多个音频片段中,对应同一发言人的音频片段添加有相同的用户标记,对应同一发言人的音频片段是根据多个音频片段的时长和声纹特征对多个音频片段进行分层次聚类得到的。
在本实施例中,针对多人会议场景,可进行会议记录,会议记录过程为:采集多人会议场景中的音频信号,识别出音频信号中的发言人变更点;根据音频信号中的发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;进而,根据音频片段的时长和声纹特征对多个音频片段进行分层次聚类,得到对应同一发言人的音 频片段;为对应同一发言人的音频片段添加相同用户标记,得到添加用户标记的音频信号;根据添加用户标记的音频信号生成会议记录信息,并为该会议记录信息添加对应的会议标识。关于会议记录的相关过程可参见前述实施例,在此不再赘述。
在得到会议记录信息之后,可以通过会议记录信息查阅相关会议内容,于是对外提供会议查阅服务。基于此,可接收外部发出的会议查阅请求,该请求中携带待呈现的会议标识;基于该会议标识和各会议记录信息中的会议标识,可以从中得到待呈现的会议记录信息,并呈现该会议记录信息。可选地,若该会议记录信息是添加用户标记的音频信号,则可以通过播放器播放添加用户标记的音频信号,或者,也可以将添加用户标记的音频信号转换为文本信息之后进行显示;若该会议记录信息是由添加用户标记的音频信号转换成的带有发言人信息的文本信息,则可以通过显示器显示该带有发言人信息的文本信息,或者,也可以通过播放器播放带有发言人信息的文本信息。这样,可满足会议内容的查询或查阅需求。
另外,需要说明的是,由于会议记录信息中体现了发言人信息或对应的用户标记,因此,在查阅会议记录时,可单独查阅或回放某个发言人对应的会议内容,而不是多个发言人的信息混淆在一起,提高了对会议发言人和会议内容的识别度。例如,在会议查阅请求中除了包含待呈现的会议标识之外,还可以同时包含发言人信息或用户标记,发言人信息与用户标记存在对应关系,这样,可以根据会议标识,获取待呈现的会议记录信息;根据发言人信息或用户标记,获取会议记录信息中对应于该发言人信息或用户标记的部分会议内容,呈现对应于发言人信息或用户标记的部分会议内容。
需要说明的是,本申请实施例提供的方法可以全部由拾音设备完成,也可以将一部分功能在服务端设备上实现,对此不做限定。其中,拾音设备可以实现为录音笔、录音棒、录音机或拾音器等,也可以实现为带有录音功能的终端设备或者音视频会议设备等。基于此,本实施例提供一种音频处理系统,对音频信号处理方法基于拾音设备和服务端设备共同实现的过程进行说明。如图4a所示,该音频处理系统400包括:拾音设备401和服务端设备402。该音频处理系统400可以应用到多人发言场景中,例如图3a所示的多人会议场景,图3b所示的商务合作商谈场景以及图3c所示的教学场景等。在这些场景中,拾音设备401可与服务端设备402配合实现本申请上述各方法实施例,在图3a至图3c所示多人发言场景中未示出服务端设备402。
本实施例的拾音设备401具有开机按键、调节按键、麦克风阵列以及扬声器等功能模块,进一步可选地,还可以包括显示屏。拾音设备401可以实现自动录音、MP3播放、FM调频、数码相机功能、电话录音、定时录音、外部转录、复读机或编辑等功能。如图4a所示,拾音设备401可以在多人发言场景中,采集多人发言场景中的音频信号,识别音频信号中的发言人变更点,根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段对应的声纹特征,并将多个音频片段及其对应的声纹特征发送至服务 端设备402。
在本实施例中,服务端设备402可以接收拾音设备401发送的多个音频片段及其对应的声纹特征,根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
在本实施例中,拾音设备401可以利用麦克风阵列拾取多人发言场景中的音频信号,基于麦克风阵列中不同位置上麦克风拾取到的同一声音信号的强度,可以计算出该声音信号的声源位置。基于此,在本申请一可选实施例中,拾音设备401在识别音频信号中的发言人变更点时,可以对音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点,确定音频信号中的发言人变更点,进一步,可以根据发言人变更点切分出多个音频片段,每个音频对应有唯一的声源位置。相应地,服务端设备402在根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段时,可以根据多个音频片段的时长、声纹特征和声源位置,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段。
在本实施例中,如图4b所示,还提供一种音频处理系统,图4b所示实施例与图4a所示实施例的区别在于:在图4a中,提取多个音频片段对应的声纹特征的过程在拾音设备401上实现,而在图4b中,提取多个音频片段对应的声纹特征的过程在服务端设备402上实现,其它内容图4a与图4b所示内容相同或相似,详细内容可参见前述实施例,在此不再赘述。
在本实施例中,服务端设备402为对应同一发言人的音频片段添加相同的用户标记之后,可以对添加用户标记的音频信号进行存储,以备后续查询与使用。在一可选实施例中,如图4a所示,音频处理系统中还包括转写设备403,服务端设备402可以将添加用户标记的音频信号发送给转写设备403,转写设备403接收该添加用户标记的音频信号,将添加用户标记的音频信号转换为带有用户标记的文本信息,并将该带有用户标记的文本信息返回给服务端设备402或者存储至数据库406。进一步,如图4a所示,音频处理系统中还包括查询端404,查询端404可以向服务端设备402发送第一查询请求,第一查询请求包括待查询的用户标记,服务端设备402接收第一查询请求,从带有用户标记的文本信息中获取与待查询的用户标记对应的文本信息并返回给查询端404。
在另一可选实施例中,如图4a所示,服务端设备402在生成带有用户标记的音频信号后,可以将带有用户标记的音频信号输出至服务端设备402上的上层应用,例如上层应用可以是远程会议应用或者社交应用等,上层应用可以获取多人发言场景中的用户信息,例如,用户的标识信息,如姓名、昵称或者声纹特征等,上层应用可以将用户信息与带有用户标记的音频信号进行关联。其中,用户信息与带有用户标记的音频信号的关联方式并不限定,例如,上层应用中存储有用户标记与用户信息的对应关系,基于此对 应关系,可以找到用户标记对应的用户信息,将该用户信息与带有用户标记的音频信号进行关联。
进一步,如图4a所示,查询端404可以向服务端设备402发送第二查询请求,第二查询请求包括待查询的音频片段,服务端设备402接收第二查询请求,从添加用户标记的音频信号中提取与待查询的音频片段对应的用户标记,并将用户标记和/或用户标记对应的用户信息返回给查询端404。
在又一可选实施例中,如图4b所示,音频处理系统中还包括回放设备405,回放设备405可以向服务端设备402发送音频信号获取请求,服务端设备402可以基于该请求将添加用户标记的音频信号输出至回放设备405,回放设备405接收并播放该添加用户标记的音频信号。
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤101a至步骤103a的执行主体可以为设备A;又比如,步骤101a和102a的执行主体可以为设备A,步骤103a的执行主体可以为设备B;等等。
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101a、102a等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。
图5为本申请示例性实施例提供的一种拾音设备的结构示意图。如图5所示,该拾音设备包括:处理器55和存储器54。
存储器54,用于存储计算机程序,并可被配置为存储其它各种数据以支持在拾音设备上的操作。这些数据的示例包括用于在拾音设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。
存储器54可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
处理器55,与存储器54耦合,用于执行存储器54中的计算机程序,以用于:识别在多人发言场景中采集到的音频信号中的发言人变更点;根据发言人变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应 同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
其中,上述过程可以全部在拾音设备上完成,也可以将部分功能放在服务端设备上执行,例如,提取多个音频片段的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号的部分可以由服务端配合完成。
在一可选实施例中,处理器55在根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段时,具体用于:根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;根据多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
在一可选实施例中,处理器55在根据多个音频片段的时长,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段时,具体用于:根据多个音频片段的时长和预先设定的各层的时长阈值,对多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;其中,层数越小,对应的时长阈值越大,且每层中音频片段的时长大于或等于该层的时长阈值。
在一可选实施例中,处理器55在根据多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果时,具体用于:对于第一层中的音频片段,根据第一层中音频片段对应的声纹特征,对第一层中的音频片段进行聚类,得到至少一个聚类结果;对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征,将非第一层中的音频片段向已有的聚类结果进行聚类;以及若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据剩余音频片段对应的声纹特征对剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到一个聚类结果中为止。
在一可选实施例中,处理器55在识别在多人发言场景中采集到的音频信号中的发言人变更点时,具体用于:对音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点,确定音频信号中的发言人变更点;其中,由发言人变更点切分出的每个音频片段对应有唯一的声源位置。
在一可选实施例中,处理器55在根据多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对多层音频片段进行分层次聚类,以得到至少一个聚类结果时,具体用于:根据多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对多层音频片段进行层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
在一可选实施例中,处理器55在根据多个音频片段对应的声纹特征和声源位置,按 照时长范围由长到短的顺序对多层音频片段进行层次聚类,以得到至少一个聚类结果时,具体用于:对于第一层的音频片段,根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类,得到至少一个聚类结果;对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征和声源位置,将非第一层中的音频片段向已有的聚类结果进行聚类;以及若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据剩余音频片段对应的声纹特征和声源位置对剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到一个聚类结果中为止。
在一可选实施例中,对于第一层的音频片段,处理器55在根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类,得到至少一个聚类结果时,具体用于:在第一层至少包含两个音频片段的情况下,根据第一层中至少两个音频片段对应的声纹特征和声源位置,计算第一层中至少两个音频片段之间的整体相似度;根据第一层中至少两个音频片段之间的整体相似度,将第一层中至少两个音频片段划分至少一个聚类结果中;以及根据至少一个聚类结果中包含的音频片段对应的声纹特征和声源位置,分别计算至少一个聚类结果的聚类中心,聚类中心包括中心声纹特征和中心声源位置。
在一可选实施例中,对于非第一层中的音频片段,处理器55在按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征和声源位置,将非第一层中的音频片段向已有的聚类结果进行聚类时,具体用于:对任意一个非第一层中的每个音频片段,根据该音频片段对应的声纹特征和声源位置和已有聚类结果的聚类中心,计算该音频片段与已有聚类结果的整体相似度;若已有聚类结果中存在与该音频片段的整体相似度满足设定相似度条件的目标聚类结果,将该音频片段加入目标聚类结果中,并根据该音频片段对应的声纹特征和声源位置更新目标聚类结果的聚类中心。
在一可选实施例中,处理器55在根据该音频片段对应的声纹特征和声源位置更新目标聚类结果的聚类中心时,具体用于:确定目标聚类结果中包含的各音频片段所属的层数,其中,不同层数对应不同的权重,且层数越小,对应的权重越大;根据各音频片段所属的层数对应的权重,对各音频片段对应的声纹特征和声源位置分别进行加权求和,得到新的中心声纹特征和新的中心声源位置作为目标聚类结果更新后的聚类中心。
在一可选实施例中,处理器55在根据多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对多层音频片段进行层次聚类,以得到至少一个聚类结果时,具体用于:根据每层中音频片段对应的声纹特征,对每层中的音频片段进行聚类,得到每层的聚类结果;按照层数由小到大的顺序,根据每层聚类结果的声纹特征,依次对相邻两层的聚类结果进行聚类,以得到至少一个聚类结果。
在一可选实施例中,处理器55还用于:将添加用户标记的音频信号输出至转写设备, 以供转写设备将添加用户标记的音频信号转换为带有用户标记的文本信息;或者将添加用户标记的音频信号输出至回放设备,以供回放设备播放带有用户标记的音频信号;或者将添加用户标记的音频信号输出至上层应用,以供上层应用获取用户标记对应的用户信息并与带有用户标记的音频片段进行关联。
在一可选实施例中,处理器55还用于:接收第一查询请求,第一查询请求包括待查询的用户标记,从带有用户标记的文本信息中获取与待查询的用户标记对应的文本信息,并返回给发起第一查询请求的查询端;或者接收第二查询请求,第二查询请求包括待查询的音频片段,从添加用户标记的音频信号中提取与待查询的音频片段对应的用户标记,并将用户标记和/或用户标记对应的用户信息返回给发起第二查询请求的查询端。
关于各操作的详细描述可参见前述方法实施例中的描述,在此不再赘述。
进一步,如图5所示,该拾音设备还包括:通信组件56、显示器57、电源组件58、音频组件59等其它组件。图5中仅示意性给出部分组件,并不意味着拾音设备只包括图5所示组件。
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时,致使处理器能够实现图1a和图1b所示方法实施例中可由拾音设备执行的各步骤。
相应地,本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,计算机程序/指令被处理器执行时,致使处理器能够实现图1a和图1b所示方法实施例中可由拾音设备执行的各步骤。
本申请实施例还提供一种拾音设备,该拾音设备的实现结构与图5所示拾音设备的实现结构相同或类似,可参照图5所示拾音设备的结构实现。本实施例提供的拾音设备与图5所示实施例中拾音设备的区别主要在于:处理器执行存储器中存储的计算机程序所实现的功能不同。对本实施例提供的拾音设备来说,其处理器执行存储器中存储的计算机程序,可用于:对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;根据声源位置的变更点将音频信号切分为多个音频片段,并提取多个音频片段的声纹特征;根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。关于各操作的详细描述可参见前述方法实施例中的描述,在此不再赘述。
其中,上述过程可以全部在拾音设备上完成,也可以将部分功能放在服务端设备上执行,例如,提取多个音频片段的声纹特征;根据多个音频片段的声纹特征和声源位置,对多个音频片段进行聚类,得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号的过程,可以由服务端设备配合完成。
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时,致使处理器能够实现图1c所示方法实施例中可由拾音设备执行的各步骤。
相应地,本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,计算机程序/指令被处理器执行时,致使处理器能够实现图1c所示方法实施例中可由拾音设备执行的各步骤。
图6为本申请示例性实施例提供的一种服务端设备的结构示意图。如图6所示,该服务端设备包括:处理器65和存储器64。
存储器64,用于存储计算机程序,并可被配置为存储其它各种数据以支持在服务端设备上的操作。这些数据的示例包括用于在服务端设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。
存储器64可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
处理器65,与存储器64耦合,用于执行存储器64中的计算机程序,以用于:接收拾音设备发送的多个音频片段及其对应的声纹特征;根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。关于各操作的详细描述可参见前述方法实施例中的描述,在此不再赘述。
进一步,如图6所示,该服务端设备还包括:通信组件66、电源组件68等其它组件。图6中仅示意性给出部分组件,并不意味着服务端设备只包括图6所示组件。
本申请实施例还提供一种服务端设备,该服务端设备的实现结构与图6所示服务端设备的实现结构相同或类似,可参照图6所示服务端设备的结构实现。本实施例提供的服务端设备与图6所示实施例中服务端设备的区别主要在于:处理器执行存储器中存储的计算机程序所实现的功能不同。对本实施例提供的服务端设备来说,其处理器执行存储器中存储的计算机程序,可用于:接收拾音设备发送的多个音频片段;提取多个音频片段对应的声纹特征,根据多个音频片段的时长和声纹特征,对多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。关于各操作的详细描述可参见前述方法实施例中的描述,在此不再赘述。
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时,致使处理器能够实现音频信号处理方法实施例中可由服务端设备执行的各步骤。
相应地,本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,计算机程序/指令被处理器执行时,致使处理器能够实现上述音频信号处理方法实施例中可由服务端设备执行的各步骤。
除了上述设备之外,本申请实施例还提供一种会议记录设备,该会议记录设备包括:存储器和处理器;存储器用于存储计算机程序;处理器与处理器耦合,用于执行存储器中存储的计算机程序,以用于:采集多人会议场景中的音频信号,识别所述音频信号中的发言人变更点;根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号;根据所述添加用户标记的音频信号生成会议记录信息,所述会议记录信息包括会议标识。
本申请实施例还提供一种会议记录呈现设备,该会议记录呈现设备包括:存储器和处理器;存储器用于存储计算机程序;处理器与处理器耦合,用于执行存储器中存储的计算机程序,以用于:接收会议查阅请求,所述会议查阅请求包含待呈现的会议标识;根据所述会议标识,获取待呈现的会议记录信息;呈现所述会议记录信息,所述会议记录信息是根据多人会议场景中添加用户标记的音频信号生成的;其中,根据所述音频信号中的发言人变更点所切分出的多个音频片段中,对应同一发言人的音频片段添加有相同的用户标记,对应同一发言人的音频片段是根据所述多个音频片段的时长和声纹特征对所述多个音频片段进行分层次聚类得到的。
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时,致使处理器能够实现图3d或图3e所示方法实施例中的各步骤。
相应地,本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,计算机程序/指令被处理器执行时,致使处理器能够实现图3d或图3e所示方法实施例中的各步骤。
上述图5和图6中的通信组件被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G/LTE、5G等移动通信网络,或它们的组合。在一个示例性实施例中,通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
上述图5中的显示器包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感 器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。
上述图5和图6中的电源组件,为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。
上述图5中的音频组件,可被配置为输出和/或输入音频信号。例如,音频组件包括一个麦克风(MIC),当音频组件所在设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器或经由通信组件发送。在一些实施例中,音频组件还包括一个扬声器,用于输出音频信号。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介 质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (26)

  1. 一种音频信号处理方法,其特征在于,包括:
    识别在多人发言场景中采集到的音频信号中的发言人变更点;
    根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;
    根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;
    为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  2. 根据权利要求1所述的方法,其特征在于,根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段,包括:
    根据所述多个音频片段的时长,对所述多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;
    根据所述多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对所述多层音频片段进行分层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
  3. 根据权利要求2所述的方法,其特征在于,根据所述多个音频片段的时长,对所述多个音频片段进行分层,以得到对应不同时长范围的多层音频片段,包括:
    根据所述多个音频片段的时长和预先设定的各层的时长阈值,对所述多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;
    其中,层数越小,对应的时长阈值越大,且每层中音频片段的时长大于或等于该层的时长阈值。
  4. 根据权利要求3所述的方法,其特征在于,根据所述多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对所述多层音频片段进行分层次聚类,以得到至少一个聚类结果,包括:
    对于第一层中的音频片段,根据第一层中音频片段对应的声纹特征,对第一层中的音频片段进行聚类,得到至少一个聚类结果;
    对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征,将非第一层中的音频片段向已有的聚类结果进行聚类;以及
    若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据所述剩余音频片段对应的声纹特征对所述剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到一个聚类结果中为止。
  5. 根据权利要求3所述的方法,其特征在于,识别在多人发言场景中采集到的音频信号中的发言人变更点,包括:
    对所述音频信号进行声源定位,以得到声源位置的变更点;
    根据所述声源位置的变更点,确定所述音频信号中的发言人变更点;其中,由所述发言人变更点切分出的每个音频片段对应有唯一的声源位置。
  6. 根据权利要求5所述的方法,其特征在于,根据所述多个音频片段对应的声纹特征,按照时长范围由长到短的顺序对所述多层音频片段进行分层次聚类,以得到至少一个聚类结果,包括:
    根据所述多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对所述多层音频片段进行层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
  7. 根据权利要求6所述的方法,其特征在于,根据所述多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对所述多层音频片段进行层次聚类,以得到至少一个聚类结果,包括:
    对于第一层的音频片段,根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类,得到至少一个聚类结果;
    对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征和声源位置,将非第一层中的音频片段向已有的聚类结果进行聚类;以及
    若非第一层中存在未被聚类到已有聚类结果中的剩余音频片段,则根据所述剩余音频片段对应的声纹特征和声源位置对所述剩余音频片段进行聚类,以产生新的聚类结果,直至所有层上的每个音频片段均被聚类到一个聚类结果中为止。
  8. 根据权利要求7所述的方法,其特征在于,对于第一层的音频片段,根据第一层中音频片段对应的声纹特征和声源位置,对第一层中的音频片段进行聚类,得到至少一个聚类结果,包括:
    在第一层至少包含两个音频片段的情况下,根据第一层中至少两个音频片段对应的声纹特征和声源位置,计算所述第一层中至少两个音频片段之间的整体相似度;
    根据所述第一层中至少两个音频片段之间的整体相似度,将所述第一层中至少两个音频片段划分至少一个聚类结果中;以及
    根据所述至少一个聚类结果中包含的音频片段对应的声纹特征和声源位置,分别计算所述至少一个聚类结果的聚类中心,所述聚类中心包括中心声纹特征和中心声源位置。
  9. 根据权利要求8所述的方法,其特征在于,对于非第一层中的音频片段,按照层数由小到大的顺序,依次根据非第一层中音频片段对应的声纹特征和声源位置,将非第一层中的音频片段向已有的聚类结果进行聚类,包括:
    对任意一个非第一层中的每个音频片段,根据该音频片段对应的声纹特征和声源位置和已有聚类结果的聚类中心,计算该音频片段与已有聚类结果的整体相似度;
    若已有聚类结果中存在与该音频片段的整体相似度满足设定相似度条件的目标聚 类结果,将该音频片段加入所述目标聚类结果中,并根据该音频片段对应的声纹特征和声源位置更新所述目标聚类结果的聚类中心。
  10. 根据权利要求9所述的方法,其特征在于,根据该音频片段对应的声纹特征和声源位置更新所述目标聚类结果的聚类中心,包括:
    确定所述目标聚类结果中包含的各音频片段所属的层数,其中,不同层数对应不同的权重,且层数越小,对应的权重越大;
    根据所述各音频片段所属的层数对应的权重,对所述各音频片段对应的声纹特征和声源位置分别进行加权求和,得到新的中心声纹特征和新的中心声源位置作为所述目标聚类结果更新后的聚类中心。
  11. 根据权利要求6所述的方法,其特征在于,根据所述多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对所述多层音频片段进行层次聚类,以得到至少一个聚类结果,包括:
    根据每层中音频片段对应的声纹特征,对每层中的音频片段进行聚类,得到每层的聚类结果;
    按照层数由小到大的顺序,根据每层聚类结果的声纹特征,依次对相邻两层的聚类结果进行聚类,以得到至少一个聚类结果。
  12. 根据权利要求1-11任一项所述的方法,其特征在于,还包括:
    将添加用户标记的音频信号输出至转写设备,以供所述转写设备将所述添加用户标记的音频信号转换为带有用户标记的文本信息;
    或者
    将添加用户标记的音频信号输出至回放设备,以供所述回放设备播放所述带有用户标记的音频信号;
    或者
    将添加用户标记的音频信号输出至上层应用,以供所述上层应用获取所述用户标记对应的用户信息并与带有用户标记的音频片段进行关联。
  13. 根据权利要求12所述的方法,其特征在于,还包括:
    接收第一查询请求,所述第一查询请求包括待查询的用户标记,从带有用户标记的文本信息中获取与待查询的用户标记对应的文本信息,并返回给发起第一查询请求的查询端;
    或者
    接收第二查询请求,所述第二查询请求包括待查询的音频片段,从添加用户标记的音频信号中提取与待查询的音频片段对应的用户标记,并将所述用户标记和/或所述用户标记对应的用户信息返回给发起第二查询请求的查询端。
  14. 一种音频信号处理方法,其特征在于,包括:
    对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;
    根据所述声源位置的变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;
    根据所述多个音频片段的声纹特征和声源位置,对所述多个音频片段进行聚类,得到对应同一发言人的音频片段;
    为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  15. 根据权利要求14所述的方法,其特征在于,根据所述多个音频片段的声纹特征和声源位置,对所述多个音频片段进行聚类,得到对应同一发言人的音频片段,包括:
    根据所述多个音频片段的时长,对所述多个音频片段进行分层,以得到对应不同时长范围的多层音频片段;
    根据所述多个音频片段对应的声纹特征和声源位置,按照时长范围由长到短的顺序对所述多层音频片段进行层次聚类,以得到至少一个聚类结果,每个聚类结果中包括对应同一发言人的音频片段。
  16. 一种会议记录方法,其特征在于,包括:
    采集多人会议场景中的音频信号,识别所述音频信号中的发言人变更点;
    根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;
    根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;
    为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号;
    根据所述添加用户标记的音频信号生成会议记录信息,所述会议记录信息包括会议标识。
  17. 一种会议记录呈现方法,其特征在于,包括:
    接收会议查阅请求,所述会议查阅请求包含待呈现的会议标识;
    根据所述会议标识,获取待呈现的会议记录信息;
    呈现所述会议记录信息,所述会议记录信息是根据多人会议场景中添加用户标记的音频信号生成的;
    其中,根据所述音频信号中的发言人变更点所切分出的多个音频片段中,对应同一发言人的音频片段添加有相同的用户标记,对应同一发言人的音频片段是根据所述多个音频片段的时长和声纹特征对所述多个音频片段进行分层次聚类得到的。
  18. 一种音频处理系统,其特征在于,包括:拾音设备和服务端设备;
    所述拾音设备部署在多人发言场景中,用于采集多人发言场景中的音频信号,识别所述音频信号中的发言人变更点,根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段对应的声纹特征;
    所述服务端设备,用于根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  19. 根据权利要求18所述的系统,其特征在于,
    所述拾音设备具体用于:对所述音频信号进行声源定位,以得到声源位置的变更点;根据所述声源位置的变更点,确定所述音频信号中的发言人变更点;其中,由所述发言人变更点切分出的每个音频片段对应有唯一的声源位置;
    所述服务端设备具体用于:根据所述多个音频片段的时长、声纹特征和声源位置,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段。
  20. 一种音频处理系统,其特征在于,包括:拾音设备和服务端设备;
    所述拾音设备部署在多人发言场景中,用于采集多人发言场景中的音频信号,识别所述音频信号中的发言人变更点,根据所述发言人变更点将所述音频信号切分为多个音频片段;
    所述服务端设备,用于提取所述多个音频片段对应的声纹特征,根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  21. 一种拾音设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序;
    所述处理器与所述存储器耦合,用于执行所述计算机程序,以用于:识别在多人发言场景中采集到的音频信号中的发言人变更点;根据所述发言人变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  22. 一种拾音设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序;
    所述处理器与所述存储器耦合,用于执行所述计算机程序,以用于:对在多人发言场景中采集到的音频信号进行声源定位,以得到声源位置的变更点;根据所述声源位置的变更点将所述音频信号切分为多个音频片段,并提取所述多个音频片段的声纹特征;根据所述多个音频片段的声纹特征和声源位置,对所述多个音频片段进行聚类,得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  23. 一种服务端设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序;
    所述处理器与所述存储器耦合,用于执行所述计算机程序,以用于:接收拾音设备发送的多个音频片段及其对应的声纹特征;根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  24. 一种服务端设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序;
    所述处理器与所述存储器耦合,用于执行所述计算机程序,以用于:接收拾音设备发送的多个音频片段;提取所述多个音频片段对应的声纹特征,根据所述多个音频片段的时长和声纹特征,对所述多个音频片段进行分层次聚类,以得到对应同一发言人的音频片段;为对应同一发言人的音频片段添加相同的用户标记,以得到添加用户标记的音频信号。
  25. 一种存储有计算机程序的计算机可读存储介质,其特征在于,当所述计算机程序被处理器执行时,致使所述处理器实现权利要求1-17任一项所述方法中的步骤。
  26. 一种计算机程序产品,包括计算机程序/指令,其特征在于,当所述计算机程序/指令被处理器执行时,致使所述处理器实现权利要求1-17任一项所述方法中的步骤。
PCT/CN2022/073092 2021-01-26 2022-01-21 音频信号处理、会议记录与呈现方法、设备、系统及介质 WO2022161264A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110105959.1 2021-01-26
CN202110105959.1A CN114792522A (zh) 2021-01-26 2021-01-26 音频信号处理、会议记录与呈现方法、设备、系统及介质

Publications (1)

Publication Number Publication Date
WO2022161264A1 true WO2022161264A1 (zh) 2022-08-04

Family

ID=82460469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073092 WO2022161264A1 (zh) 2021-01-26 2022-01-21 音频信号处理、会议记录与呈现方法、设备、系统及介质

Country Status (2)

Country Link
CN (1) CN114792522A (zh)
WO (1) WO2022161264A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828907B (zh) * 2023-02-16 2023-04-25 南昌航天广信科技有限责任公司 智能会议管理方法、系统、可读存储介质及计算机设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021785A (zh) * 2014-05-28 2014-09-03 华南理工大学 一种提取会议中最重要嘉宾语音的方法
CN106657865A (zh) * 2016-12-16 2017-05-10 联想(北京)有限公司 会议纪要的生成方法、装置及视频会议系统
CN107733666A (zh) * 2017-10-31 2018-02-23 珠海格力电器股份有限公司 一种会议实现方法、装置及电子设备
CN111128223A (zh) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 一种基于文本信息的辅助说话人分离方法及相关装置
CN111613249A (zh) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 一种语音分析方法和设备
CN111739553A (zh) * 2020-06-02 2020-10-02 深圳市未艾智能有限公司 会议声音采集、会议记录以及会议记录呈现方法和装置
CN112088315A (zh) * 2018-05-07 2020-12-15 微软技术许可有限责任公司 多模式语音定位
CN112165599A (zh) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 一种用于视频会议的会议纪要自动生成方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021785A (zh) * 2014-05-28 2014-09-03 华南理工大学 一种提取会议中最重要嘉宾语音的方法
CN106657865A (zh) * 2016-12-16 2017-05-10 联想(北京)有限公司 会议纪要的生成方法、装置及视频会议系统
CN107733666A (zh) * 2017-10-31 2018-02-23 珠海格力电器股份有限公司 一种会议实现方法、装置及电子设备
CN112088315A (zh) * 2018-05-07 2020-12-15 微软技术许可有限责任公司 多模式语音定位
CN111128223A (zh) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 一种基于文本信息的辅助说话人分离方法及相关装置
CN111613249A (zh) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 一种语音分析方法和设备
CN111739553A (zh) * 2020-06-02 2020-10-02 深圳市未艾智能有限公司 会议声音采集、会议记录以及会议记录呈现方法和装置
CN112165599A (zh) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 一种用于视频会议的会议纪要自动生成方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN114792522A (zh) 2022-07-26

Similar Documents

Publication Publication Date Title
US20230402038A1 (en) Computerized intelligent assistant for conferences
US10819811B2 (en) Accumulation of real-time crowd sourced data for inferring metadata about entities
WO2022161264A1 (zh) 音频信号处理、会议记录与呈现方法、设备、系统及介质
US8219404B2 (en) Method and apparatus for recognizing a speaker in lawful interception systems
US7995732B2 (en) Managing audio in a multi-source audio environment
US20080235018A1 (en) Method and System for Determing the Topic of a Conversation and Locating and Presenting Related Content
US8086461B2 (en) System and method for tracking persons of interest via voiceprint
TWI536365B (zh) 聲紋辨識
WO2020238209A1 (zh) 音频处理的方法、系统及相关设备
US20160162844A1 (en) Automatic detection and analytics using sensors
US11011159B2 (en) Detection of potential exfiltration of audio data from digital assistant applications
US20120035919A1 (en) Voice recording device and method thereof
CN111739530A (zh) 一种交互方法、装置、耳机和耳机收纳装置
CN108320761B (zh) 音频录制方法、智能录音设备及计算机可读存储介质
CN110570847A (zh) 一种多人场景的人机交互系统及方法
US10037756B2 (en) Analysis of long-term audio recordings
JP2006279111A (ja) 情報処理装置、情報処理方法およびプログラム
JP2015094811A (ja) 通話録音可視化システムおよび通話録音可視化方法
WO2019155716A1 (ja) 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN110415703A (zh) 语音备忘信息处理方法及装置
CN111739529A (zh) 一种交互方法、装置、耳机和服务器
WO2021134720A1 (zh) 一种会议数据处理方法及相关设备
CN113889081A (zh) 语音识别方法、介质、装置和计算设备
KR102291113B1 (ko) 회의록 작성 장치 및 방법
US6934364B1 (en) Handset identifier using support vector machines

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745139

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22745139

Country of ref document: EP

Kind code of ref document: A1