WO2023088448A1 - 语音处理方法、设备及存储介质 - Google Patents

语音处理方法、设备及存储介质 Download PDF

Info

Publication number
WO2023088448A1
WO2023088448A1 PCT/CN2022/133015 CN2022133015W WO2023088448A1 WO 2023088448 A1 WO2023088448 A1 WO 2023088448A1 CN 2022133015 W CN2022133015 W CN 2022133015W WO 2023088448 A1 WO2023088448 A1 WO 2023088448A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
segment
voice
segments
clustering
Prior art date
Application number
PCT/CN2022/133015
Other languages
English (en)
French (fr)
Inventor
王宪亮
索宏彬
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023088448A1 publication Critical patent/WO2023088448A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the technical field of audio processing, in particular to a voice processing method, device and storage medium.
  • the role separation technology can determine which character speaks each part of the speech, and has a wide range of application requirements in the field of conference systems and other fields.
  • the voice is usually segmented first to obtain multiple voice segments with a preset duration, and then the similarity between two segments is calculated, and the segments are gradually merged based on the similarity score from high to low.
  • the similarity When the score is lower than the preset threshold, the merging is stopped, so as to obtain the result of role separation.
  • the disadvantage of the existing technology is that, by clustering the speech segments with a preset duration, the result obtained is seriously fragmented, and the accuracy of role separation is poor, which affects the user experience.
  • the main purpose of the embodiments of the present application is to provide a voice processing method, device, and storage medium, so as to reduce fragmentation of role separation results and improve role separation effects.
  • the embodiment of the present application provides a voice processing method, including:
  • the single-channel voice is segmented to obtain multiple voice fragments; wherein the role change point information is used to indicate the change of the speaking role in the single-channel voice Position;
  • the plurality of speech segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • clustering the plurality of first segments, and assigning the at least one second segment to a category obtained after clustering, to obtain a role separation result of the single-channel speech;
  • the speech text corresponding to each participating role is output.
  • the embodiment of the present application provides a voice processing method, including:
  • the speech to be processed is segmented to obtain a plurality of speech fragments; wherein the role change point information is used to indicate the position where the speaking role in the speech to be processed changes;
  • the plurality of speech segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • the embodiment of the present application provides a voice processing method, including:
  • the plurality of speech segments include a plurality of first segments and at least one second segment whose reliability is less than the first segment;
  • the credibility of the speech segment is used to characterize the credibility of a clustering result obtained by clustering based on the speech segment.
  • the embodiment of the present application provides a voice processing device, including:
  • the obtaining module is used to obtain single-channel voices corresponding to multiple participating roles collected by the conference system;
  • the first segmentation module is configured to segment the single-channel speech according to the role change point information in the single-channel speech to obtain a plurality of speech segments; wherein the role change point information is used to represent the single-channel The position where the speaking role in the voice changes; the plurality of voice segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • a first processing module configured to cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, to obtain a role separation result of the single-channel speech
  • An output module configured to output the speech text corresponding to each participating role according to the role separation result and the text information corresponding to the single-channel voice.
  • the embodiment of the present application provides a voice processing device, including:
  • the second segmentation module is used to segment the speech to be processed according to the role change point information in the speech to be processed to obtain a plurality of speech segments; wherein, the role change point information is used to indicate that in the speech to be processed The position where the speaking role is changed; the plurality of speech segments include a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • the second processing module is configured to cluster the plurality of first segments, assign the at least one second segment to a category obtained after clustering, and obtain a role separation result of the speech to be processed.
  • the embodiment of the present application provides a voice processing device, including:
  • the third segmentation module is configured to segment the speech to be processed to obtain a plurality of speech segments; wherein, the plurality of speech segments include a plurality of first segments and at least one second segment whose credibility is less than the first segment;
  • a third processing module configured to cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, to obtain a role separation result of the speech to be processed;
  • the credibility of the speech segment is used to characterize the credibility of a clustering result obtained by clustering based on the speech segment.
  • the embodiment of the present application provides a voice processing device, including:
  • the memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the speech processing device performs the first aspect or the second aspect or the third aspect the method described.
  • the embodiment of the present application provides a voice processing device, including: a processing device and at least one of the following communication-connected with the processing device: a voice input device, a display device;
  • the voice output device is used to collect the voice to be analyzed and send it to the processing device;
  • the display device is used to display the role separation result determined by the processing device and/or the voice-to-text information determined through the role separation result;
  • the processing device is configured to execute the method described in the first aspect or the second aspect or the third aspect.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the first aspect or the second aspect is realized Or the method described in the third aspect.
  • an embodiment of the present application provides a computer program product, including a computer program, and when the computer program is executed by a processor, implements the method described in the first aspect or the second aspect or the third aspect.
  • the speech processing method, device, and storage medium provided by the present application can segment the speech to be processed according to the information on the role change point in the speech to be processed to obtain a plurality of speech segments, wherein the information on the role change point is used for Indicates the position where the speaking role in the speech to be processed changes, the plurality of speech segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment, clustering the plurality of first segments, and assigning the at least one second segment to the clustered category to obtain a role separation result of the speech to be processed, so that the first segment-based
  • the clustering result guides the classification of the second segment, which greatly reduces the problem of fragmentation and significantly improves the user experience effect, and does not depend on the use of thresholds to determine the clustering termination condition. It has better robustness in different environments and effectively improves The accuracy and stability of role separation are improved.
  • FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application
  • FIG. 2 is a schematic flow chart of a speech processing method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the application of a role separation result provided by the embodiment of the present application.
  • FIG. 4 is a schematic flow chart of another voice processing method provided in the embodiment of the present application.
  • Fig. 5 is a schematic diagram of the principle of role separation provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a principle of determining a speech window provided by an embodiment of the present application.
  • FIG. 7 is a schematic flow chart of a clustering method provided in an embodiment of the present application.
  • FIG. 8 is a schematic flow chart of another speech processing method provided by the embodiment of the present application.
  • FIG. 9 is a schematic flowchart of another voice processing method provided in the embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a speech processing device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another speech processing device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of another speech processing device provided by the embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a voice processing device provided in an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another speech processing device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application.
  • multiple users A, B, and C can use the same voice input device such as a microphone, and the voice input device transmits the acquired single-channel voice to the processing device, and the processing device is to be processed Speech performs role separation to distinguish the roles corresponding to each part of the speech.
  • the voice can be segmented according to the preset duration, such as 1 second, and after obtaining multiple 1-second segments, the features of each segment can be extracted, and the similarity between two segments can be calculated, and a clustering algorithm can be used , gradually merge segments based on the similarity score from high to low, and stop merging when the similarity score is lower than the threshold.
  • the preset duration such as 1 second
  • the features of each segment can be extracted, and the similarity between two segments can be calculated, and a clustering algorithm can be used , gradually merge segments based on the similarity score from high to low, and stop merging when the similarity score is lower than the threshold.
  • the embodiment of the present application provides a speech processing method applicable to a conference system, which can segment a single-channel speech according to role change points, first cluster the long segments, and then assign the short segments to Corresponding category centers, so that the clustering results based on long segments can guide the classification of short segments, which greatly reduces the problem of fragmentation, significantly improves user experience, and does not depend on the use of thresholds to determine clustering termination conditions. In different environments It has better robustness and effectively improves the accuracy and stability of role separation.
  • FIG. 2 is a schematic flowchart of a speech processing method provided by an embodiment of the present application.
  • the method in this embodiment can be applied to the scenario shown in FIG. 1 , and the execution subject of the method can be any device with a data processing function, such as the processing device in FIG. 1 .
  • the voice input device and the processing device may be separated or integrated, for example, it may be implemented through an all-in-one conference system, or through terminals such as mobile phones, computers, and tablet devices.
  • the terminal may send the voice to be processed to the server, and the server may feed back the result to the terminal after obtaining the role separation result through the method provided in the embodiment of the present application.
  • the method may include:
  • Step 201 Obtain single-channel voices corresponding to multiple conference participants collected by the conference system.
  • the conference system may be implemented by hardware, software, or a combination of software and hardware.
  • the conference system may include a voice input device and a processing device in FIG. 1, and the voice input device in the conference system collects voices corresponding to multiple participating roles, and the voice is a single-channel voice; or, the conference system may include The application program can process the collected single-channel voice.
  • Step 202 Segment the single-channel speech according to the role change point information in the single-channel speech to obtain multiple speech segments.
  • the role change point information is used to indicate the position where the speaking role changes in the single-channel speech;
  • the plurality of speech segments include a plurality of first segments and at least one second segment, and any first segment The length is greater than the length of any second fragment.
  • Step 203 clustering a plurality of first segments, and assigning at least one second segment to a category obtained after clustering, to obtain a role separation result of the single-channel speech.
  • Step 204 According to the role separation result and the text information corresponding to the single-channel voice, output speech text corresponding to each participating role.
  • character recognition may be performed on the single-channel speech to obtain corresponding text information, and combined with the result of role separation, the speech text corresponding to each participating role may be determined.
  • different participating roles may have different identification methods. For example, multiple participating roles may be respectively marked as role ID1, role ID2, ...; or, multiple participating roles may be respectively marked as roles A, B, C, ... and so on.
  • FIG. 3 is a schematic diagram of an application of a role separation result provided by an embodiment of the present application.
  • text recognition can be performed on the speech to be processed collected in the meeting to obtain corresponding text information.
  • the text information does not distinguish each participating role.
  • the collected single-channel speech can be role-separated, and the result of role separation can be used to indicate the role ID corresponding to each speech segment, so as to combine the text information of the meeting to determine the The role that the text part belongs to, and each sentence is marked with the speaking role, so as to effectively realize the meeting record and classification, and improve the user experience.
  • the voice processing method provided in this embodiment can obtain single-channel voices corresponding to multiple conference roles collected by the conference system, and segment the single-channel voices according to the role change point information in the single-channel voices to obtain multiple speech segments, wherein the role change point information is used to represent the position where the speaking role in the single-channel speech changes, the plurality of speech segments include a plurality of first segments and at least one second segment, and any The length of the first fragment is greater than the length of any second fragment, clustering the plurality of first fragments, and assigning the at least one second fragment to a category obtained after clustering to obtain the single channel
  • the role separation result of the voice according to the text information corresponding to the role separation result and the single-channel voice, output the speech text corresponding to each participating role, which can quickly and accurately realize role separation for the single-channel voice in the conference system, and It has strong performance in different noise environments, meets meeting needs in different environments, and improves user experience.
  • the voice processing method provided in one or more embodiments of the present application may also be applied to any scenario that requires role separation.
  • the following example illustrates.
  • one or more embodiments of the present application can be applied to educational scenarios, including offline scenarios and/or online scenarios, and the roles involved have multiple identities, such as teachers, students, Teaching assistant, etc., each identity can have at least one role. For example, teachers have one and students have multiple.
  • the speech collected in class and extracurricular is collected and processed, and the separation of different roles can be realized.
  • a voice processing method may include: acquiring voices to be processed output by multiple characters collected by an educational assistance system, where the voices to be processed output by the multiple characters are single-channel voices; processing the role change point information in the speech, and segmenting the speech to be processed to obtain multiple speech segments; wherein, the role change point information is used to indicate the position where the speaking role in the speech to be processed changes; the A plurality of speech segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment; the plurality of first segments are clustered, and the At least one second segment is assigned to the category obtained after clustering to obtain the role separation result of the speech to be processed; according to the role separation result of the speech to be processed, speech information corresponding to at least part of the roles is extracted; the speech information be in speech and/or text form.
  • the corresponding voice can be collected and the method provided by the embodiment of this application can be used to separate the roles, and the fragments of each student's speech can be obtained, and the speech information of some or all students can be selected and displayed For teachers, it is convenient for teachers to evaluate or guide.
  • one or more embodiments of the present application may be applied to court trial scenarios.
  • the voice collected at the court trial scene can be processed, and then the separation of different roles can be realized.
  • a voice processing method may include: acquiring voices to be processed output by multiple roles collected at the court trial site, the voices to be processed are single-channel voices; point information, the speech to be processed is segmented to obtain a plurality of speech segments; wherein, the role change point information is used to indicate the position where the speaking role in the speech to be processed changes; the plurality of speech segments include multiple first fragments and at least one second fragment, and the length of any first fragment is greater than the length of any second fragment; clustering the plurality of first fragments, and assigning the at least one second fragment According to the category obtained after clustering, the role separation result of the speech to be processed is obtained; according to the role separation result of the speech to be processed and the text information corresponding to the speech to be processed, a court trial record is generated.
  • the voice of the court trial scene can be collected, and the method provided by this application can be used to separate the roles of the voice, and combined with the text corresponding to the voice, the corresponding court trial record can be generated to improve the generation of court trial records.
  • Efficiency and accuracy providing more efficient and reliable text records for court trials.
  • one or more embodiments of the present application may be applied to audio recording arrangement.
  • one or more recordings can be sorted out, and the collection object of the recordings can be the voice output by a person or a machine, and the collection time of the recordings is not limited.
  • a speech processing method may include: obtaining at least one speech to be processed; segmenting the speech to be processed according to the role change point information in the speech to be processed, to obtain A plurality of speech segments; wherein, the role change point information is used to indicate the position where the speaking role in the speech to be processed changes; the plurality of speech segments include a plurality of first segments and at least one second segment, and any The length of a first fragment is greater than the length of any second fragment; clustering the plurality of first fragments, and assigning the at least one second fragment to the category obtained after clustering, to obtain the to-be Processing a role separation result of the speech; sorting the at least one speech to be processed based on the role separation result.
  • voice sorting may include, but is not limited to: classifying or sorting multiple voices according to roles; marking the number of roles corresponding to each voice; extracting multiple voices with high degree of role overlap; at least one voice The characters appearing in are sorted according to the duration; extract at least one voice segment corresponding to part or all of the characters in the voice, or the text corresponding to the voice segment; and so on.
  • role separation technology it can quickly and accurately organize voice or voice fragments, effectively improving the effect of voice sorting and meeting the needs of different users.
  • FIG. 4 is a schematic flowchart of another voice processing method provided by the embodiment of the present application. As shown in Figure 4, the method may include:
  • Step 401 Segment the speech to be processed according to the role change point information in the speech to be processed to obtain a plurality of speech segments.
  • the method in this embodiment can be applied to any scenario.
  • the voice to be processed may be a single-channel voice collected by a conference system; in an education scenario, the voice to be processed may be a single-channel voice collected by an educational assistance system; The voice to be processed may be a single-channel voice collected at the trial site; in the recording arrangement scenario, the voice to be processed may be at least one piece of voice to be sorted.
  • the specific implementation methods are similar and will not be repeated here.
  • the role change point information is used to indicate the position where the speaking role in the speech to be processed changes;
  • the plurality of speech segments includes a plurality of first segments and at least one second segment, and any first segment The length is greater than the length of any second fragment.
  • the role change point information is used to indicate at which second the speaking role has changed during the 30 seconds
  • the role change point information may include: the 5th second, the 15th second seconds and 20 seconds
  • the voice to be processed can be divided into at least four voice segments: the voice segment of the 0th to 5th second, the voice segment of the 5th to 15th second, and the voice segment of the 15th to 20th second , the voice clips of the 20th to 30th seconds, each voice clip can correspond to a character, but it is still impossible to distinguish the character ID corresponding to each voice clip.
  • multiple speech segments may be divided into long segments and short segments, which are respectively recorded as a first segment and a second segment.
  • the length of any first segment may be greater than the length of any second segment.
  • the length division can be set according to actual needs. For example, a segment longer than 5 seconds can be considered as the first segment, and a segment shorter than or equal to 5 seconds can be considered as the second segment.
  • Step 402 Clustering a plurality of first segments, and assigning at least one second segment to a category obtained after clustering, to obtain a role separation result of the speech to be processed.
  • multiple first fragments can be clustered first, and the obtained clustering result can include multiple categories and the category center of each category, wherein the number of multiple categories is used to represent the role corresponding to the speech to be processed The number, the category center corresponding to each category can be used to represent the centroid corresponding to the first segment of the category.
  • the second segment can be assigned to the clustering results.
  • it may be determined which of the multiple categories each second segment is closest to, and the second segment is assigned to the closest category.
  • FIG. 5 is a schematic diagram of a principle of role separation provided by an embodiment of the present application.
  • the speech to be processed can be divided into 10 segments based on the role change point information, which are respectively recorded as segment 1 to segment 10.
  • segments 1-3, 5, and 8-10 have longer durations and belong to the first segment.
  • segments 4, 6, and 7 are shorter in duration and belong to the second segment.
  • first fragments will be clustered to obtain three categories, among which, fragments 1 and 10 belong to category 1, fragments 3, 5, and 9 belong to category 2, and fragments 2 and 8 belong to category 3.
  • a plurality of second segments are then assigned to these three categories, wherein segments 4 and 6 belong to category 1, and segment 7 belongs to category 2.
  • Categories 1 to 3 can correspond to characters A, B, and C, respectively. According to the clustering result and the allocation result, the corresponding role of each part of the speech to be processed can be obtained. In this way, the voices to be processed are marked, which is convenient for subsequent operations such as voice-to-text conversion, and enhances the conference effect.
  • the speech processing method can segment the speech to be processed according to the role change point information in the speech to be processed to obtain multiple speech segments, wherein the role change point information is used for Indicates the position where the speaking role in the speech to be processed changes, the plurality of speech segments includes a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment, clustering the plurality of first fragments, and assigning the at least one second fragment to the category obtained after clustering, so that the classification of the second fragments can be guided based on the clustering results of the first fragments, greatly It reduces the problem of fragmentation, significantly improves the user experience, and does not depend on the use of thresholds to determine the clustering termination conditions. It has better robustness in different environments, and effectively improves the accuracy and stability of role separation.
  • the speech to be processed is segmented to obtain a plurality of speech segments, which may include: detecting by a speech activity endpoint determining at least one valid voice segment in the voice to be processed; performing role change point detection on the valid voice segment, and dividing the at least one valid voice segment into the plurality of voices according to the obtained role change point information Segments; wherein, each voice segment is the voice corresponding to a single character.
  • voice activity detection also known as voice activity detection
  • VAD Voice Activity Detection
  • voice activity detection can determine when the speaker starts speaking and when he stops speaking, so that invalid voice segments in the voice to be processed can be eliminated. Get at least one valid speech segment.
  • Chang point detection can detect the position where the speaking role changes in the speech.
  • the at least one valid voice segment may be further divided into multiple voice segments, and each voice segment may be regarded as a speech segment of a single character.
  • the speech to be processed can be quickly divided into multiple speech segments, the invalid speech in the speech to be processed is eliminated, and the effective speech segments are further divided according to the position of the role change, which improves the Accuracy and efficiency of subsequent clustering operations.
  • role change point detection first, divide the speech to be processed into at least one speech segment, and then further segment through voice activity endpoint detection to obtain the multiple voice segments; or, the voice activity End point detection may not be necessary, and the speech to be processed is divided into the plurality of speech segments directly through role change point detection.
  • performing role change point detection on the effective voice segment may include: determining at least One voice window, and extract the feature of the voice window; determine the role change point information according to the similarity of the features of adjacent voice windows.
  • FIG. 6 is a schematic diagram of a principle of determining a speech window provided by an embodiment of the present application.
  • the effective speech segment can be divided into multiple speech windows according to the preset window length and sliding duration, for example, the preset window length is 1.5 seconds, and the sliding duration is 0.75 seconds, Then a valid speech segment with a duration of 4.5 seconds can be divided into 5 speech windows: 0-1.5 seconds, 0.75-2.25 seconds, 1.5-3 seconds, 2.25-3.75 seconds, 3-4.5 seconds, which are respectively recorded as speech windows 1 -5, there is an overlap of 0.75 seconds between two adjacent speech windows.
  • the features corresponding to each speech window can be extracted.
  • the embedding (embedding) feature of the speech window can be extracted by methods such as xvector (an embedded vector representation method based on a neural network model).
  • xvector an embedded vector representation method based on a neural network model.
  • the similarity is calculated for the features of two adjacent speech windows, and the role change point can be detected according to the similarity.
  • the similarity between certain two adjacent speech windows is less than a certain similarity threshold, it indicates that there may be a role change.
  • the similarity between voice window 1 and voice window 2 and voice window 3 are greater than the similarity threshold
  • the similarity between voice window 4 and voice window 5 is also greater than Similarity threshold
  • only the similarity between voice window 3 and voice window 4 is less than the similarity threshold, then it can be considered that a role change has occurred between voice windows 3 and 4, and the effective voice segment is further divided into two voices Segment, the two speech segments include speech windows 1-3 and speech windows 4-5 respectively.
  • the valid voice segments can be divided only based on the preset window length, and there is no overlap between adjacent voice windows; or, the valid voice segments can also be divided only based on the preset sliding duration, each The window length of the voice window may not be fixed.
  • the specific values of the preset window length and the sliding duration may be adjusted according to actual needs, which is not limited in this embodiment of the present application.
  • the role change point information can be determined according to the similarity of the features of adjacent speech windows , so that the role change point can be detected based on the continuous change of the characteristics of the effective speech segment, and the detection accuracy can be improved.
  • features of valid speech segments may be extracted in parallel.
  • determining at least one speech window corresponding to the effective speech segment based on the preset window length and/or sliding duration, and extracting the features of the speech window may include: using multithreading to perform parallel processing on each effective speech segment, For each effective speech segment, at least one speech window corresponding to the effective speech segment is determined based on the preset window length and/or the sliding duration, and features of the speech window are extracted.
  • multiple threads may be used, and each thread processes one or more effective speech segments.
  • Each thread divides the effective speech segment to be processed into multiple speech windows and extracts the features of each speech window.
  • multiple threads may also be used to process multiple speech windows in parallel, so as to further improve the efficiency of feature extraction.
  • dividing the at least one effective speech segment into the plurality of speech segments may include: splicing the features obtained after the parallel processing in chronological order, and combining the role change point information, dividing The at least one active speech segment is divided into the plurality of speech segments.
  • time information can be carried during parallel processing, and the time information can be the position or serial number of each valid speech segment in the entire speech to be processed.
  • the obtained features can be spliced in time order, Combined with the role change point information, multiple speech segments for clustering or allocation are obtained, thereby effectively improving the processing speed.
  • multiple effective speech segments may also be sequentially processed, so that time information is not required, and after all effective speech segments are processed, features of multiple speech windows arranged in time order are directly obtained.
  • a post-processing operation may be performed on the speech segments before clustering.
  • the speech segment can be merged with adjacent speech segments, and according to the result obtained after the merging operation Multiple speech segments, distinguishing the first segment from the second segment.
  • the preset threshold can be 2. After a plurality of voice segments are obtained by VAD and CPD segmentation, if any voice segment only contains a single voice window, the voice segment is combined with the previous voice segment or the next The voice segments are merged, and after the merging is completed, the obtained multiple voice segments are divided into the first segment and the second segment for clustering and allocation, which can reduce fragmented voice segments and further improve the accuracy of clustering.
  • each voice segment in the plurality of voice segments it may be determined according to a threshold whether it belongs to the first segment or the second segment.
  • the voice segment is the first segment; if the number of voice windows included in the voice segment is less than the number threshold, the voice segment is Second fragment.
  • the number threshold may be 5, and if a certain speech segment contains more than 5 speech windows, then the speech segment is the first segment, otherwise, it is the second segment. Speech segments can be quickly and accurately divided by the number threshold.
  • the threshold may also be dynamically adjusted according to the result of speech segmentation. For example, if the median number of voice windows corresponding to multiple voice segments is k, the number threshold can be adjusted to 0.5k, so that the threshold for dividing long and short segments can be dynamically adjusted according to the actual situation of different voices to be processed to meet different environments Under the application requirements, improve the applicability.
  • the first segment and the second segment can be divided in proportion, for example, the first 70% of the length is divided into the first segment, and the latter 30% is divided into the second segment, avoiding Too many or too few first fragments will affect the subsequent clustering and allocation effects.
  • clustering the plurality of first fragments, and assigning the at least one second fragment to a category obtained after clustering may include: For each first segment, the features of at least one speech window corresponding to the first segment are averaged to obtain the features corresponding to the first segment, and according to the features corresponding to multiple first segments, multiple first The segments are clustered; for each second segment, the features of at least one speech window corresponding to the second segment are averaged to obtain the features corresponding to the second segment, and according to the features corresponding to at least one second segment, Assigning the at least one second segment to a class obtained after clustering.
  • the embedding feature obtained by each 1.5-second speech window can be a 512-dimensional vector
  • each first segment contains at least one speech window
  • the features of the at least one speech window are averaged to obtain a 512-dimensional vector
  • a feature corresponding to the first segment as a whole may be characterized.
  • the feature corresponding to the second segment as a whole may be represented by an average value of features of at least one speech window included in the second segment. Extract features through the speech window and further calculate the features of the first segment and the second segment, so that the final features can more accurately reflect the speech characteristics of the first segment and the second segment, and then according to the features of the first segment and the second segment Clustering and allocation can effectively improve the accuracy of clustering and allocation.
  • features may be directly extracted from the speech segment without using the speech window, and the step of calculating the mean value may be omitted.
  • one or more embodiments of the present application may be applied to realize unsupervised role separation, wherein unsupervised role separation may refer to obtaining the number of roles in speech and each The time information of a character speaking.
  • each optional number of categories can be traversed, the clustering results under each number of categories can be determined in turn, and the final clustering result can be selected from them, so as to realize overall unsupervised role separation.
  • FIG. 7 is a schematic flow chart of a clustering method provided by an embodiment of the present application. As shown in Figure 7, clustering the multiple first fragments may include:
  • Step 701 Traversing 2 to a preset number of categories, clustering the plurality of first segments by using a supervised clustering algorithm under the traversed number of categories, and obtaining a clustering result corresponding to the number of categories.
  • the number of preset categories can be set according to actual needs.
  • the number of preset categories is denoted as M, and M is a positive integer greater than 2. Traversing 2 to M, for each numerical value traversed, using this numerical value as the number of categories, perform supervised clustering, and obtain the clustering results under the number of categories, and the clustering results are used to represent the clustering under the number of categories The resulting categories and the corresponding category center for each category.
  • kmeans (k-means) clustering algorithm may be used to implement clustering of the plurality of first fragments.
  • 2 can be selected as the category number of the kmeans algorithm first, and then the category centers corresponding to the two categories are initialized and clustered, and the clustering result obtained indicates that each first segment in the plurality of first segments belongs to this Which of the two categories, and the category center determined after clustering; similarly, select 3 as the number of categories to obtain the corresponding clustering results; and so on, until the clustering corresponding to each category number in 2 to M is obtained class results.
  • Step 702 according to the clustering results corresponding to different category numbers, determine the number of characters corresponding to the speech to be processed and the clustering results.
  • determining the number of characters corresponding to the speech to be processed and the clustering results may be implemented in the following manner.
  • the requirement can be set according to actual needs, for example, the inter-class distance is greater than the intra-class distance, or the ratio of the inter-class distance to the intra-class distance is within a preset range.
  • the clustering result corresponding to the preset category number M meets the requirements. Specifically, the intra-class distance and inter-class distance corresponding to the M categories in the clustering result can be calculated. If the inter-class distance is greater than the intra-class distance, it is determined that the requirements are met, the clustering result is the final clustering result, and the speech to be processed
  • the number of corresponding roles is M, and each role corresponds to a category.
  • the requirement is not met, and then calculate whether the clustering result corresponding to M-1 meets the requirement, if it is satisfied, it is the final clustering result, otherwise, continue to calculate M -2, until the result that meets the requirements is obtained.
  • the final clustering result can be made more accurate and the clustering accuracy can be improved.
  • the plurality of first fragments are clustered by a supervised clustering algorithm under the number of categories traversed to obtain a clustering result corresponding to the number of categories, And according to the clustering results corresponding to different categories, the number of roles corresponding to the speech to be processed and the clustering results are determined, and unsupervised role separation can be quickly and accurately realized without knowing the number of roles in advance.
  • the traversal can also be canceled, and the speech to be processed can be analyzed through the neural network model to obtain the number of roles of the speech to be processed, and clustering is performed based on the number of roles to achieve overall unsupervised role separation.
  • one or more embodiments of the present application can also be applied to implement supervised role separation.
  • the number of roles may be input by the user, or the number of roles may be determined according to the conference information, and then clustering is performed based on the number of roles, so as to realize sorting and supervised role separation.
  • assigning the at least one second segment to the category obtained after clustering may include: according to the relationship between the second segment and the speech to be processed The similarity of each category center in the clustering result is used to assign the second segment to the corresponding category.
  • the feature corresponding to the first segment may be a 512-dimensional vector, and after clustering multiple first segments, the obtained category center is used to represent the centroid of the first segment under the category, or 512 Dimensional vector to represent.
  • the feature corresponding to the second segment that is, a 512-dimensional vector, can be used to calculate the similarity with each category center, and the category to which the second segment belongs is determined according to the similarity.
  • a post-processing operation may be performed on the speech segments after clustering.
  • the roles corresponding to each voice segment if there is a voice segment with a duration shorter than the preset duration, and the two adjacent voice segments before and after the voice segment correspond to the same role, then modify the role corresponding to the voice segment It is the character corresponding to the two speech segments before and after, and the speech segment is merged with the two adjacent speech segments before and after.
  • the preset duration may be 0.5 seconds. After the clustering and allocation operations, if any voice segment is less than 0.5 seconds, it corresponds to character A, and its previous voice segment and subsequent voice segment both correspond to For character B, the character corresponding to the voice clip can be changed from A to B, so as to achieve smooth processing of role separation and improve user experience.
  • the voice segment and the previous voice segment can be separated according to the feature similarity. Or the latter voice segment is merged.
  • FIG. 8 is a schematic flowchart of another voice processing method provided by the embodiment of the present application. As shown in Figure 8, the method of parallel feature extraction plus first clustering and then assignment can be used to realize role separation, which may specifically include the following steps.
  • Step a Perform VAD on the speech to be processed, remove invalid speech from the speech, and obtain valid speech segments.
  • valid voice segments may include VAD segment 1, VAD segment 2, ..., VAD segment n.
  • Step b Extracting embedded features for each effective speech segment.
  • a parallel processing manner may be used. According to the window length of 1.5 seconds and the sliding time of 0.75 seconds on each effective speech segment, xvector is used to extract the embedding features of each speech window.
  • Step c Perform CPD detection on each VAD segment to obtain role change point information in the VAD segment.
  • CPD detection may be implemented by using the embedding features of adjacent speech windows.
  • a post-processing operation can be performed to correct the speech segment obtained by the VAD plus CPD segmentation. After correction, the features corresponding to the speech segment can be obtained.
  • VAD segment 1 the features corresponding to each speech segment in VAD segment 1, VAD segment 2, ..., VAD segment n can be obtained.
  • Step d Splicing the parallelized features in chronological order, and combining the role change point information to obtain a plurality of speech segments, and the plurality of speech segments are classified according to the number of speech windows.
  • this step may include feature splicing, merging and segmentation.
  • splicing may refer to splicing multiple features obtained through parallel processing in chronological order
  • merging and segmenting may refer to re-segmenting the merged features according to role change points to obtain multiple speech segments. According to the number of speech windows contained in each speech segment, the speech segment is divided into a long segment and a short segment, respectively corresponding to the aforementioned first segment and second segment.
  • Step e calculate the mean value of the long segments, and traverse from 2 to the maximum number of roles to perform supervised kmeans clustering.
  • the speech windows contained in the long segments obtained in step d can be averaged to obtain the features corresponding to each long segment, and the clustering result obtained by Kmeans clustering algorithm and Speakercount.
  • Speakercount can refer to the number of speakers, that is, the number of roles, and can be traversed from 2 to the maximum number of roles (ie, the number of preset categories) for supervised kmeans clustering.
  • Step f using the clustering result to determine the number of roles.
  • the inter-class distance and intra-class distance of the clustering results under different categories can be calculated.
  • the obtained category number and clustering result are Final Results.
  • Step g distribute the short segments to the category center obtained in step f according to the similarity.
  • the short segments obtained in step d can be averaged according to the features of the included speech windows to obtain the features corresponding to each short segment, and according to the similarity between the feature and the category center, assign the short segment to the corresponding On the center of the category, the assignment (assignment) result is obtained.
  • Step h post-processing the results, and updating the results for points inconsistent with the front and rear character information.
  • the categories corresponding to each speech segment can be obtained, and each category corresponds to a role ID.
  • a post-processing operation can be performed. For very short speech segments (for example, less than 0.5 seconds) for the corresponding role.
  • this solution classifies the fragments according to the continuous duration (for example, 5 speech windows as the boundary), first clusters the long fragments, and then divides the short fragments into the cluster centers, and at the same time, the before and after results are analyzed through post-processing operations Inconsistent points are updated, which greatly reduces the problem of fragmentation and improves the user experience; moreover, this solution avoids the use of thresholds to determine the clustering termination condition, the effect is more stable, and it has better robustness in different environments sex. On the same test set, the correct rate of role separation in the traditional method is about 65%, and the separation accuracy rate of this scheme can reach 92%.
  • the correct rate of role separation in the traditional method is about 65%, and the separation accuracy rate of this scheme can reach 92%.
  • the embedding feature extraction method can also adopt different neural network results, such as TDNN (Time Delay Neural Network, time delay network), Resnet, etc.
  • TDNN Time Delay Neural Network, time delay network
  • Resnet Resnet
  • the clustering method can be Use kmeans or other clustering methods, such as AHC (Agglomerative hierarchical clustering, hierarchical clustering algorithm), various community clustering methods, etc.
  • FIG. 9 is a schematic flowchart of another voice processing method provided by the embodiment of the present application. As shown in Figure 9, the method includes:
  • Step 901 Segment the speech to be processed to obtain a plurality of speech segments; wherein, the plurality of speech segments include a plurality of first segments and at least one second segment whose reliability is lower than that of the first segment.
  • the credibility of the speech segment is used to characterize the credibility of a clustering result obtained by clustering based on the speech segment.
  • the credibility of the speech segment may be determined by at least one of the length of the speech segment, the position of the speech segment in the speech to be processed, and a deep learning model.
  • the plurality of speech segments those whose reliability is greater than a preset value are classified as first segments, and those whose reliability is lower than a preset value are classified as second segments.
  • the credibility may be determined by the length of the speech segment. The longer the length, the higher the trustworthiness, and the shorter the length, the lower the trustworthiness.
  • the plurality of speech segments may be divided into a plurality of first segments and at least one second segment according to length, and the length of any first segment is greater than the length of any second segment.
  • the length can be represented by the duration of the speech segment or the number of contained speech windows.
  • the speech to be processed may be segmented according to the role change point information in the speech to be processed to obtain a plurality of speech segments, and then the first segment and the second segment are distinguished.
  • the speech to be processed may be segmented according to the role change point information in the speech to be processed to obtain a plurality of speech segments, and then the first segment and the second segment are distinguished.
  • the credibility of the speech segment may be determined according to the position of the speech segment in the speech to be processed. For example, it may be noisy at the beginning and end of a meeting, so the credibility of speech segments at the start and end positions may be less than that of other positions.
  • the user may also input the position of the speech segment with low reliability.
  • the user can input the position of each stage of the meeting in the speech to be processed.
  • the credibility of the discussion stage is lower than that of the individual speech stage, so that the more suitable audio clips can be selected from multiple speech clips.
  • the fragments are clustered, and then other fragments are assigned to the clustering results, which has a faster processing speed and can meet the needs of different conference scenarios.
  • the credibility of each speech segment may be calculated through a deep learning model.
  • the deep learning model can be trained through training samples, which can include speech samples and corresponding labels, and the labels can be obtained by manual marking. After the training is completed, the speech to be processed can be input into the deep learning model to determine the corresponding credibility. The credibility of speech clips can be determined more quickly and accurately through deep learning models.
  • the credibility may also be determined by combining at least two items of the duration of the speech segment, the position of the speech segment in the speech to be processed, and a deep learning model.
  • the duration and location of the speech segment can be combined and analyzed. If both the duration and position meet certain requirements, it will be classified as the first segment, otherwise, it will be classified as the second segment.
  • the duration of the speech segment can be combined with the deep learning model for analysis. Only the duration greater than a certain threshold will be sent to the deep learning model for credibility prediction, and it will be judged to belong to the first segment according to the prediction result. It is still the second segment, and the shorter one is directly divided into the second segment.
  • the duration, location, and deep learning model of the speech segment can be combined and analyzed. If the duration and location meet certain requirements, it will be sent to the deep learning model for credibility prediction, and judged according to the prediction result. Whether it belongs to the first segment or the second segment, if the duration and location do not meet certain requirements, it is directly divided into the second segment.
  • the credibility of the speech clip can be determined more accurately, and the effect of subsequent clustering and allocation can be improved.
  • Step 902 Cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, to obtain a role separation result of the speech to be processed.
  • the speech processing method provided in this embodiment can segment the speech to be processed to obtain multiple speech segments, wherein the credibility of the speech segments is used to represent the reliability of the clustering results obtained by clustering based on the speech segments.
  • the plurality of speech segments includes a plurality of first segments and at least one second segment whose reliability is lower than the first segment, clustering the plurality of first segments, and dividing the at least one
  • the second segment is assigned to the category obtained after clustering, and the role separation result of the speech to be processed is obtained, so that the clustering result of the segment with higher reliability can be used to guide the classification of segments with lower reliability, It greatly reduces the problem of fragmentation, significantly improves the user experience, and does not depend on the use of thresholds to determine the clustering termination conditions. It has better robustness in different environments, and effectively improves the accuracy and stability of role separation.
  • FIG. 10 is a schematic structural diagram of a speech processing device provided by an embodiment of the present application. As shown in Figure 10, the speech processing device may include:
  • An acquisition module 1001 configured to acquire single-channel voices corresponding to multiple participating roles collected by the conference system
  • the first segmentation module 1002 is configured to segment the single-channel speech according to the role change point information in the single-channel speech to obtain multiple speech segments; wherein, the role change point information is used to represent the single-channel speech The position where the speaking role changes in the channel voice; the plurality of voice segments include a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • the first processing module 1003 is configured to cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, to obtain a role separation result of the single-channel speech;
  • the output module 1004 is configured to output the speech text corresponding to each participating role according to the role separation result and the text information corresponding to the single-channel speech.
  • the voice processing device provided in this embodiment can be used to execute the technical solutions provided in the embodiments shown in FIG. 1 to FIG. 3 , and its implementation principles and technical effects are similar, and will not be repeated here.
  • FIG. 11 is a schematic structural diagram of another speech processing device provided by an embodiment of the present application. As shown in Figure 11, the speech processing device may include:
  • the second segmentation module 1101 is configured to segment the speech to be processed according to the role change point information in the speech to be processed to obtain a plurality of speech segments; wherein, the role change point information is used to represent the speech to be processed The position where the speaking role is changed; the plurality of speech segments include a plurality of first segments and at least one second segment, and the length of any first segment is greater than the length of any second segment;
  • the second processing module 1102 is configured to cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, so as to obtain a role separation result of the speech to be processed.
  • the second segmentation module 1101 is specifically configured to: determine at least one valid voice segment in the voice to be processed through voice active endpoint detection; detecting a change point, and dividing the at least one valid speech segment into the plurality of speech segments according to the obtained character change point information; wherein, each speech segment is a voice corresponding to a single character.
  • the second segmentation module 1101 when the second segmentation module 1101 performs role change point detection on the effective speech segment, it is specifically configured to: determine the At least one speech window corresponding to the valid speech segment, and extracting the features of the speech window; determining role change point information according to the similarity of features of adjacent speech windows.
  • the second segmentation module 1101 determines at least one speech window corresponding to the valid speech segment based on the preset window length and/or sliding duration, and extracts the When features are used, it is specifically used for: using multithreading to perform parallel processing on each effective voice segment, and for each effective voice segment, based on a preset window length and/or a sliding duration to determine at least one voice window corresponding to the effective voice segment, And extract the features of the voice window; in one or more embodiments of the present application, the second segmentation module 1101 divides the at least one effective voice segment into the multiple When there are two speech segments, it is specifically used for: splicing the features obtained after the parallel processing in time order, and combining the role change point information to divide the at least one effective speech segment into the plurality of speech segments.
  • the voice segment if the number of voice windows contained in the voice segment is greater than the number threshold, the voice segment is the first segment; if the number of voice windows contained in the voice segment is less than the number threshold, the speech segment is the second segment.
  • the second processing module 1102 is specifically configured to: for each first segment, average the features of at least one speech window corresponding to the first segment to obtain the The features corresponding to the first segment, and according to the features corresponding to the first segments, cluster the plurality of first segments; for each second segment, obtain the feature of at least one speech window corresponding to the second segment The mean value is used to obtain the features corresponding to the second segment, and according to the feature corresponding to the at least one second segment, assign the at least one second segment to the categories obtained after clustering.
  • the second processing module 1102 when the second processing module 1102 clusters the plurality of first fragments, it is specifically configured to: traverse 2 to a preset number of categories, and in the traversed categories Clustering the plurality of first fragments by a supervised clustering algorithm several times to obtain clustering results corresponding to the number of categories; determining the corresponding role of the voice to be processed according to the clustering results corresponding to different numbers of categories number and clustering results.
  • the second processing module 1102 is specifically used to: Set the current number of categories as the preset number of categories, and repeat the following steps until the final clustering result is obtained: calculate the inter-class distance and intra-class distance of the clustering results under the current number of categories; if the inter-class distance and the If the internal distance meets the requirements, the number of roles corresponding to the speech to be processed is the current category number, and the final clustering result is the clustering result under the current category number; if the inter-class distance and the intra-class distance do not meet the requirements, then The current category number is reduced by one.
  • the second processing module 1102 when the second processing module 1102 assigns the at least one second segment to the category obtained after clustering, it is specifically configured to: according to the second segment and The similarity of each category center in the clustering result of the speech to be processed is used to assign the second segment to the corresponding category.
  • the second processing module 1102 is further configured to: if among the multiple speech segments obtained by segmentation, there is a speech segment whose number of speech windows is less than a preset threshold, then the The voice segment is merged with the adjacent voice segment, and the first segment and the second segment are distinguished according to the voice segment obtained after the merging operation; and/or, after determining the role corresponding to each voice segment, if the existence duration is less than the preset duration of the voice segment, and the two adjacent voice segments before and after the voice segment correspond to the same character, the voice segment is merged with the two adjacent voice segments.
  • the voice processing device provided in this embodiment can be used to implement the technical solutions provided in the embodiments shown in FIG. 4 to FIG. 8 , and its implementation principles and technical effects are similar, and will not be repeated here.
  • FIG. 12 is a schematic structural diagram of another speech processing device provided by an embodiment of the present application. As shown in Figure 12, the speech processing device may include:
  • the third segmentation module 1201 is configured to segment the speech to be processed to obtain a plurality of speech segments; wherein, the plurality of speech segments include a plurality of first segments and at least one second segment whose credibility is smaller than the first segment ;
  • the third processing module 1202 is configured to cluster the plurality of first segments, and assign the at least one second segment to a category obtained after clustering, to obtain a role separation result of the speech to be processed;
  • the credibility of the speech segment is used to characterize the credibility of a clustering result obtained by clustering based on the speech segment.
  • the third segmentation module 1201 is further configured to: use at least one of the length of the speech segment, the position of the speech segment in the speech to be processed, and the deep learning model The term determines the credibility of the speech segment.
  • the speech processing device provided in this embodiment can be used to implement the technical solution provided in the embodiment shown in FIG. 9 , and its implementation principle and technical effect are similar, and will not be repeated here.
  • FIG. 13 is a schematic structural diagram of a speech processing device provided by an embodiment of the present application.
  • the speech processing device of this embodiment may include: at least one processor 1301; and a memory 1302 communicatively connected to the at least one processor; Instructions executed by the processor 1301, the instructions are executed by the at least one processor 1301, so as to enable the voice processing device to execute the method described in any one of the foregoing embodiments.
  • the memory 1302 can be independent or integrated with the processor 1301 .
  • FIG. 14 is a schematic structural diagram of another speech processing device provided by an embodiment of the present application.
  • the voice processing device of this embodiment may include: a processing device 1402 and at least one of the following items communicatively connected to the processing device: a voice input device 1401, a display device 1403;
  • the voice input device 1401 is used to collect the voice to be analyzed and send it to the processing device 1402; the display device 1403 is used to display the role separation result determined by the processing device 1402 and/or through the role separation result The determined speech-to-text information; the processing device 1402 is configured to execute the speech processing method described in any one of the foregoing embodiments.
  • the voice input device 1401 may be a device capable of collecting voice such as a microphone
  • the display device 1403 may be a device with a display function such as a display screen.
  • the processing device 1402, the voice input device 1401, and the display device 1403 may be integrated together or separately.
  • the voice input device 1401 , the display device 1403 and the processing device 1402 may implement a communication connection in a wired or wireless manner.
  • the display unit 1403 may display the role separation result determined by the processing unit 1402, for example, display which character is speaking from the second to the second, or may display the voice-to-text information determined by the role separation result, the
  • the voice-to-text information may be text information that includes the result of role separation, and the text information is the text information corresponding to the voice to be processed.
  • the voice-to-text result may be the content shown on the right side of Figure 3.
  • it may also be The results of role separation and voice-to-text information are displayed simultaneously or sequentially, which is convenient for users to view meeting records and improves user experience.
  • An embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method described in any one of the preceding embodiments is implemented.
  • An embodiment of the present application further provides a computer program product, including a computer program, and when the computer program is executed by a processor, the method described in any of the preceding embodiments is implemented.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division.
  • there may be other division methods for example, multiple modules can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the above-mentioned integrated modules implemented in the form of software function modules can be stored in a computer-readable storage medium.
  • the above-mentioned software function modules are stored in a storage medium, and include several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor execute some steps of the methods described in various embodiments of the present application.
  • processor can be a central processing unit (Central Processing Unit, referred to as CPU), and can also be other general processors, digital signal processors (Digital Signal Processor, referred to as DSP), application specific integrated circuits (Application Specific Integrated Circuit, referred to as ASIC) and so on.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in conjunction with the application can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the storage may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk storage, and may also be a U disk, a mobile hard disk, a read-only memory, a magnetic disk, or an optical disk.
  • NVM non-volatile storage
  • the above-mentioned storage medium can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable In addition to programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • a storage media may be any available media that can be accessed by a general purpose or special purpose computer.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and the storage medium may be located in Application Specific Integrated Circuits (ASIC for short).
  • ASIC Application Specific Integrated Circuits
  • the processor and the storage medium can also exist in the electronic device or the main control device as discrete components.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Abstract

一种语音处理方法、设备及存储介质,其中方法包括:根据待处理语音中的角色变更点信息,对待处理语音进行分割,得到多个语音片段(401);其中,角色变更点信息用于表示待处理语音中发言角色发生变更的位置;多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;对多个第一片段进行聚类,并将至少一个第二片段分配到聚类后得到的类别中,得到待处理语音的角色分离结果(402)。有效提高角色分离的准确性和稳定性。

Description

语音处理方法、设备及存储介质
本申请要求于2021年11月18日提交中国专利局、申请号为202111365392.8、申请名称为“语音处理方法、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种语音处理方法、设备及存储介质。
背景技术
角色分离技术能够判断语音中的每一部分是哪个角色说的,在会议系统等领域有着广泛的应用需求。
现有的角色分离技术,通常先对语音进行分段,得到多个预设时长的语音片段,然后计算片段两两间的相似度,基于相似度得分由高到低逐步合并片段,当相似度得分低于事先设定的阈值时停止合并,从而得到角色分离结果。
现有技术的不足之处在于,通过对预设时长的语音片段进行聚类,得到的结果碎片化严重,且角色分离的准确性较差,影响用户体验。
发明内容
本申请实施例的主要目的在于提供一种语音处理方法、设备及存储介质,以减少角色分离结果的碎片化,提升角色分离效果。
第一方面,本申请实施例提供一种语音处理方法,包括:
获取会议系统采集的多个参会角色对应的单通道语音;
根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果;
根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
第二方面,本申请实施例提供一种语音处理方法,包括:
根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
第三方面,本申请实施例提供一种语音处理方法,包括:
对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段;
对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的 类别中,得到所述待处理语音的角色分离结果;
其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
第四方面,本申请实施例提供一种语音处理装置,包括:
获取模块,用于获取会议系统采集的多个参会角色对应的单通道语音;
第一分割模块,用于根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
第一处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果;
输出模块,用于根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
第五方面,本申请实施例提供一种语音处理装置,包括:
第二分割模块,用于根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
第二处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
第六方面,本申请实施例提供一种语音处理装置,包括:
第三分割模块,用于对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段;
第三处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;
其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
第七方面,本申请实施例提供一种语音处理设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;
其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述语音处理设备执行第一方面或第二方面或第三方面所述的方法。
第八方面,本申请实施例提供一种语音处理设备,包括:处理装置以及与所述处理装置通信连接的下述至少一项:语音输入装置、显示装置;
其中,所述语音输出装置用于采集待分析语音并发送给所述处理装置;
所述显示装置用于显示所述处理装置确定的角色分离结果和/或通过所述角色分离结果确定的语音转文信息;
所述处理装置,用于执行第一方面或第二方面或第三方面所述的方法。
第九方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现第一方面或第二方面或第三方面所述的方法。
第十方面,本申请实施例提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现第一方面或第二方面或第三方面所述的方法。
本申请提供的语音处理方法、设备及存储介质,可以根据待处理语音中的角色变 更点信息,对所述待处理语音进行分割,得到多个语音片段,其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置,所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果,从而可以实现基于第一片段的聚类结果指导第二片段的分类,大大减少了碎片化的问题,明显提升用户体验效果,且不依赖于使用阈值决定聚类终止条件,在不同环境下具有更好的鲁棒性,有效提高了角色分离的准确性和稳定性。
附图说明
图1为本申请实施例的一种应用场景示意图;
图2为本申请实施例提供的一种语音处理方法的流程示意图;
图3为本申请实施例提供的一种角色分离结果的应用示意图;
图4为本申请实施例提供的另一种语音处理方法的流程示意图;
图5为本申请实施例提供的一种角色分离的原理示意图;
图6为本申请实施例提供的一种确定语音窗的原理示意图;
图7为本申请实施例提供的一种聚类方法的流程示意图;
图8为本申请实施例提供的另一种语音处理方法的流程示意图;
图9为本申请实施例提供的又一种语音处理方法的流程示意图;
图10为本申请实施例提供的一种语音处理装置的结构示意图;
图11为本申请实施例提供的另一种语音处理装置的结构示意图;
图12为本申请实施例提供的又一种语音处理装置的结构示意图;
图13为本申请实施例提供的一种语音处理设备的结构示意图;
图14为本申请实施例提供的另一种语音处理设备的结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
下面将参照附图更详细地描述本申请的示例性实施例。虽然附图中显示了本申请的示例性实施例,然而应当理解,可以以各种形式实现本申请而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本申请,并且能够将本申请的范围完整的传达给本领域的技术人员。
本申请实施例可以用于实现语音的角色分离技术,尤其可以用于实现单通道语音的角色分离。图1为本申请实施例的一种应用场景示意图。如图1所示,在会议中,多个用户A、B、C可以使用同一语音输入装置例如麦克风,语音输入装置将获取到的单通道的待处理语音传输给处理装置,由处理装置对待处理语音进行角色分离,区分出语音中各部分对应的角色。
在一些技术中,可以先对语音按预设时长例如1秒进行分段,得到多个1秒的片段后,提取每个片段的特征,并计算片段两两间的相似度,采用聚类算法,基于相似度得分由高到低逐步合并片段,当相似度得分低于的阈值时停止合并。
该方法在实际会议系统应用中存在一些问题:
在短时的语音片段上进行两两合并,得到的聚类结果碎片化严重,影响用户体验效果;并且,采用阈值作为合并终止条件,由于不同噪声环境下的得分差别较大,不同环境下聚类效果差别很大,经常会得到远超实际角色数目的结果,因此,角色分离 结果的准确性和稳定性较差。
有鉴于此,本申请实施例提供一种可应用于会议系统的语音处理方法,可以对单通道语音按角色变更点进行分段,先对其中的长片段进行聚类,再将短片段分配到对应的类别中心,从而可以实现基于长片段的聚类结果指导短片段的分类,大大减少了碎片化的问题,明显提升用户体验效果,且不依赖于使用阈值决定聚类终止条件,在不同环境下具有更好的鲁棒性,有效提高了角色分离的准确性和稳定性。
下面结合附图,对本申请的一些实施方式作详细说明。在各实施例之间不冲突的情况下,下述的实施例及实施例中的特征可以相互组合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。另外,下述各方法实施例中的步骤时序仅为一种举例,而非严格限定。
图2为本申请实施例提供的一种语音处理方法的流程示意图。本实施例中的方法可以应用于图1所示场景,所述方法的执行主体可以为任意具有数据处理功能的设备,例如图1中的处理装置。可选的,语音输入装置和处理装置可以是分离的,也可以是集成在一起的,例如,可以通过一体机会议系统,或者,通过手机、计算机、平板设备等终端来实现本申请实施例提供的方法,或者,终端可以向服务器发送待处理的语音,服务器通过本申请实施例提供的方法得到角色分离结果后,将结果反馈给终端。
如图2所示,所述方法可以包括:
步骤201、获取会议系统采集的多个参会角色对应的单通道语音。
可选的,所述会议系统可以通过硬件、软件或者软硬件结合的方式来实现。例如,会议系统可以包括图1中语音输入装置和处理装置,会议系统中的语音输入装置采集多个参会角色对应的语音,该语音为单通道语音;或者,会议系统可以包括安装在设备中的应用程序,能够对采集到的单通道语音进行处理。
步骤202、根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段。
其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度。
步骤203、对多个第一片段进行聚类,并将至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果。
可选的,对采集到的语音进行分割、聚类和分配的具体实现方式可以参见本申请其他实施例,此处不做过多描述。
步骤204、根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
可选的,可以对所述单通道语音进行文字识别,得到对应的文本信息,再结合角色分离结果,可以确定每个参会角色对应的发言文本。
其中,不同的参会角色可以有不同的标识方式。例如,多个参会角色可以分别标记为角色ID1、角色ID2、……;或者,多个参会角色可以分别标记为角色A、B、C、……等。
图3为本申请实施例提供的一种角色分离结果的应用示意图。如图3所示,在会议结束后,可以将会议中采集的待处理语音进行文本识别,得到对应的文本信息。但是,该文本信息中没有区分各个参会角色。可以按照本申请一个或多个实施例提供的方法,对采集到的单通道语音进行角色分离,角色分离结果可以用于表示每个语音片段对应的角色ID,从而结合会议的文本信息,确定各文字部分所属的角色,并把每句话标注好说话的角色,从而有效实现了会议记录和分类,提高用户体验度。
本实施例提供的语音处理方法,可以获取会议系统采集的多个参会角色对应的单 通道语音,根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段,其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置,所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果,根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本,能够快速准确地对会议系统中的单通道语音实现角色分离,并且在不同噪声环境下均有较强的表现,满足不同环境下的会议需求,提高用户体验度。
除了图1所示场景外,本申请一个或多个实施例中提供的语音处理方法还可以应用于任意需要角色分离的场景。下面举例说明。
在一种可选的实现方式中,本申请的一个或多个实施例可以应用于教育场景,包括线下场景和/或线上场景,涉及到的角色有多种身份,例如教师、学生、助教等,每种身份可以有至少一个角色。例如,教师有一个,学生有多个。通过教育辅助系统,对课堂、课外采集到的语音进行采集并处理,可以实现对不同角色的分离。
可选的,在教育场景下,一种语音处理方法,可以包括:获取教育辅助系统采集的多个角色输出的待处理语音,所述多个角色输出的待处理语音为单通道语音;根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;根据所述待处理语音的角色分离结果,提取至少部分角色对应的发言信息;所述发言信息为语音和/或文本形式。
示例性地,课堂讨论环节由多个学生发言,可以采集对应的语音并使用本申请实施例提供的方法进行角色分离,得到每一学生发言的片段,并从中选取部分或全部学生的发言信息显示给教师,方便教师进行评价或指导。
在另一种可选的实现方式中,本申请的一个或多个实施例可以应用于庭审场景。通过庭审辅助系统,可以对庭审现场采集到的语音进行处理,进而实现对不同角色的分离。
可选的,在庭审场景下,一种语音处理方法,可以包括:获取庭审现场采集的多个角色输出的待处理语音,所述待处理语音为单通道语音;根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;根据所述待处理语音的角色分离结果与所述待处理语音对应的文本信息,生成庭审记录。
示例性地,在庭审过程中,可以对庭审现场的语音进行采集,并通过本申请提供的方法针对语音实现角色分离,再结合语音对应的文本,可以生成对应的庭审记录,提高庭审记录的生成效率和准确率,为庭审提供更高效、更可靠的文本记录。
在又一种可选的实现方式中,本申请的一个或多个实施例可以应用于录音整理。具体的,可以对一份或多分录音进行整理,所述录音的采集对象可以是人或者机器输出的语音,录音的采集时间不作限制。
可选的,在录音整理场景下,一种语音处理方法,可以包括:获取至少一份待处理语音;根据所述待处理语音中的角色变更点信息,对所述待处理语音进行分割,得 到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;基于所述角色分离结果对所述至少一份待处理语音进行整理。
可选的,语音整理可以包括但不限于:将多份语音按照角色进行分类、或者排序;标注每一份语音对应的角色数量;提取出角色重合度高的多份语音;将至少一份语音中出现的角色按照时长进行排序;提取至少一份语音中的部分或全部角色对应的语音片段、或语音片段对应的文本;等等。基于角色分离技术,可以快速准确地实现对语音或语音片段的整理,有效提高了语音整理的效果,满足不同用户的使用需求。
下面对本申请实现角色分离的语音处理过程和原理作详细说明。下述的语音处理过程可以应用于上述任一场景或者其它实际场景。
图4为本申请实施例提供的另一种语音处理方法的流程示意图。如图4所示,所述方法可以包括:
步骤401、根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段。
可选的,本实施例中的方法可以应用于任一场景。例如,在会议场景下,所述待处理语音可以是会议系统采集的单通道语音;在教育场景下,所述待处理语音可以教育辅助系统采集的单通道语音;在庭审场景下,所述待处理语音可以是庭审现场采集的单通道语音;在录音整理场景下,所述待处理语音可以是待整理的至少一份语音。当应用于其他场景时,具体的实现手段类似,不再赘述。
其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度。
示例性地,所述待处理语音有30秒,角色变更点信息用于表示这30秒中,在第几秒时发言角色发生了变化,所述角色变更点信息可以包括:第5秒、15秒、20秒的发言角色发生变更,则可以将所述待处理语音至少分为四个语音片段:第0至5秒的语音片段、第5至15秒的语音片段,第15至第20秒的语音片段、第20至30秒的语音片段,每一语音片段可以对应一个角色,但是,尚且无法区分各个语音片段对应的角色ID。
在本实施例中,可以将多个语音片段划分为长片段和短片段,分别记为第一片段和第二片段。在所述多个语音片段中,任意一个第一片段的长度均可以大于任意一个第二片段的长度。
可选的,长短的划分可以根据实际需要来设置,例如,超过5秒的片段可以认为是第一片段,小于等于5秒的片段可以认为是第二片段。
需要说明的是,不同语音片段之间可以是完全分离的,或者,不同语音片段之间可以允许有少量重叠,从而使每一语音片段可以包含更多的信息,提高角色分离效果。
步骤402、对多个第一片段进行聚类,并将至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
可选的,可以先对多个第一片段进行聚类,得到的聚类结果可以包括多个类别以及每个类别的类别中心,其中,多个类别的数量用于表示待处理语音对应的角色数目,每一类别对应的类别中心可以用于表示该类别的第一片段对应的质心。
在得到聚类结果后,可以将第二片段分配到聚类结果中。可选的,可以判断每个第二片段与多个类别中的哪一个类别最接近,并将第二片段分配到最接近的类别中。
图5为本申请实施例提供的一种角色分离的原理示意图。如图5所示,待处理语 音可以基于角色变更点信息被划分为10个片段,分别记为片段1至片段10,其中,片段1-3、5、8-10的时长较长,属于第一片段,片段4、6、7时长较短,属于第二片段。
将对多个第一片段进行聚类,得到3个类别,其中,片段1、10属于类别1,片段3、5、9属于类别2,片段2、8属于类别3。再将多个第二片段分配到这三个类别中,其中,片段4、6属于类别1,片段7属于类别2。类别1至类别3可以分别对应角色A、B、C。根据聚类结果和分配结果,可以得到待处理语音中每一部分对应的角色。从而为待处理语音做好标记,便于后续进行语音转文等操作,增强会议效果。
综上,本申请实施例提供的语音处理方法,可以根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段,其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置,所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,从而可以实现基于第一片段的聚类结果指导第二片段的分类,大大减少了碎片化的问题,明显提升用户体验效果,且不依赖于使用阈值决定聚类终止条件,在不同环境下具有更好的鲁棒性,有效提高了角色分离的准确性和稳定性。
在本申请的一个或者多个实施例中,可选的,根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段,可以包括:通过语音活动端点检测确定所述待处理语音中的至少一个有效语音片段;对所述有效语音片段进行角色变更点检测,并根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段;其中,每一语音片段为单个角色对应的语音。
其中,语音活动端点检测(Voice Activity Detection,VAD),又称为语音活动检测,能够判断说话人从什么时候开始说话,从什么时候停止说话,这样,可以剔除待处理语音中的无效语音片段,得到至少一个有效语音片段。
角色变更点检测(Chang point detection,CPD)能够检测出语音中发言角色发生变化的位置。对所述至少一个有效语音片段中的每一语音片段进行角色变更点检测,可以将所述至少一个有效语音片段进一步分割为多个语音片段,每一语音片段可以认为是单个角色的发言片段。
通过语音活动端点检测和角色变更点检测,可以将待处理语音快速地分割为多个语音片段,剔除了待处理语音中的无效语音,并将有效语音片段按照角色变更的位置进一步划分,提高了后续聚类操作的准确性和效率。
在其他可选的实现方式中,也可以先进行角色变更点检测,将待处理语音分割为至少一个语音片段,再通过语音活动端点检测进一步分割,得到所述多个语音片段;或者,语音活动端点检测可以不是必须的,直接通过角色变更点检测将待处理语音分割为所述多个语音片段。
在本申请的一个或者多个实施例中,可选的,对所述有效语音片段进行角色变更点检测,可以包括:基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征;根据相邻语音窗的特征的相似度,确定角色变更点信息。
图6为本申请实施例提供的一种确定语音窗的原理示意图。如图6所示,对于任一有效语音片段,可以按照预设窗长和滑动时长将该有效语音片段划分为多个语音窗,例如,预设窗长为1.5秒,滑动时长为0.75秒,则一个时长为4.5秒的有效语音片段,可以划分为5个语音窗:0-1.5秒、0.75-2.25秒、1.5-3秒、2.25-3.75秒、3-4.5秒,分别记为语音窗1-5,相邻两个语音窗有0.75秒的重叠部分。
在得到语音窗后,可以提取每一语音窗对应的特征,可选的,可以通过xvector (基于神经网络模型的嵌入式向量表示方法)等方法提取语音窗的embedding(嵌入)特征。相邻两个语音窗的特征计算相似度,根据相似度可以进行角色变更点的检测。
可选的,若某两个相邻的语音窗之间的相似度小于一定的相似度阈值,则说明可能存在角色变更。
举例来说,若语音窗1与语音窗2之间的相似度、语音窗2与语音窗3之间的相似度均大于相似度阈值,语音窗4与语音窗5之间的相似度也大于相似度阈值,只有语音窗3与语音窗4之间的相似度小于相似度阈值,则可以认为在语音窗3、4之间出现了角色变更,将所述有效语音片段进一步划分为两个语音片段,两个语音片段分别包括语音窗1-3、语音窗4-5。
可选的,也可以仅基于预设窗长对有效语音片段进行划分,相邻语音窗之间不存在重叠部分;或者,也可以仅基于预设的滑动时长对有效语音片段进行划分,每个语音窗的窗长可以是不固定的。预设窗长和滑动时长的具体数值可以根据实际需要进行调整,本申请实施例对此不作限制。
可选的,还可以进一步对相邻有效语音片段之间的相邻语音窗进行检测,若相邻的两个有效语音片段中,前一有效语音片段中的最后一个语音窗与后一有效语音片段中的第一个语音窗的相似度大于相似度阈值,则可以认为这两个语音窗属于同一角色,进而可以将这两个语音窗合并,实现对多个有效语音片段之间的角色变更的检测。
通过基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征,可以根据相邻语音窗的特征的相似度,确定角色变更点信息,从而能够基于有效语音片段的特征的不断变化情况检测角色变更点,提高检测准确性。
在本申请的一个或者多个实施例中,可选的,可以通过并行的方式提取有效语音片段的特征。
其中,基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征,可以包括:采用多线程对各有效语音片段进行并行化处理,对每一有效语音片段,基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征。
具体地,可以采用多个线程,每一线程处理一个或多个有效语音片段。每一线程将所需处理的有效语音片段划分为多个语音窗并提取各语音窗的特征。可选的,还可以采用多个线程对多个语音窗进行并行处理,以进一步提升提取特征的效率。
根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段,可以包括:将并行化处理后得到的特征按时间顺序进行拼接,并结合角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段。
可选的,并行处理时可以携带时间信息,所述时间信息可以为每一有效语音片段在整个待处理语音中的位置或者序号,并行处理完成后,可以将得到的特征按照时间顺序进行拼接,并结合角色变更点信息得到多个用于进行聚类或分配的语音片段,从而有效提升处理速度。
在其他可选的实现方式中,也可以对多个有效语音片段依次进行顺序处理,这样无需携带时间信息,所有有效语音片段处理完成后,直接得到按时间顺序排列的多个语音窗的特征。
在本申请的一个或者多个实施例中,可选的,还可以在聚类前对语音片段进行后处理操作。
可选的,若分割得到的多个语音片段中,存在包含的语音窗数量小于预设阈值的语音片段,则可以将该语音片段与相邻的语音片段进行合并,并根据合并操作后得到的多个语音片段,区分第一片段和第二片段。
示例性地,所述预设阈值可以为2,在通过VAD和CPD分割得到多个语音片段后,若任一语音片段仅包含单个语音窗,则将该语音片段与前一语音片段或者后一语音片段进行合并,合并完成后,将得到的多个语音片段划分第一片段和第二片段进行聚类和分配,能够减少碎片化的语音片段,进一步提高聚类的准确性。
在本申请的一个或者多个实施例中,可选的,对于所述多个语音片段中的每一语音片段,可以根据阈值确定其属于第一片段还是第二片段。
可选的,若所述语音片段包含的语音窗的数量大于数量阈值,则所述语音片段为第一片段;若所述语音片段包含的语音窗的数量小于数量阈值,则所述语音片段为第二片段。
示例性地,所述数量阈值可以为5,若某一语音片段包含5个以上的语音窗,则该语音片段为第一片段,反之,则为第二片段。通过数量阈值可以快速准确地对语音片段进行划分。
在其他可选的实现方式中,也可以根据语音分割的结果,动态调整阈值。例如,若多个语音片段对应的语音窗数量的中位数为k,则可以将数量阈值调整为0.5k,从而可以根据不同待处理语音的实际情况动态调整划分长短片段的阈值,满足不同环境下的应用需求,提高适用性。
或者,对于得到的多个语音片段,可以按照比例划分第一片段和第二片段,例如,长度在前70%的被划分为第一片段,在后30%的被划分为第二片段,避免第一片段过多或过少,影响后续的聚类和分配效果。
在本申请的一个或者多个实施例中,可选的,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,可以包括:对于每一第一片段,将所述第一片段对应的至少一个语音窗的特征求均值,得到所述第一片段对应的特征,并根据多个第一片段对应的特征,对多个第一片段进行聚类;对于每一第二片段,将所述第二片段对应的至少一个语音窗的特征求均值,得到所述第二片段对应的特征,并根据至少一个第二片段对应的特征,将所述至少一个第二片段分配到聚类后得到的类别中。
示例性地,每个1.5秒的语音窗得到的embedding特征可以为512维的向量,每个第一片段包含至少一个语音窗,对所述至少一个语音窗的特征求均值,得到512维向量,可以表征所述第一片段整体对应的特征。类似地,可以以第二片段包含的至少一个语音窗的特征的均值表征第二片段整体对应的特征。通过语音窗提取特征并进一步计算第一片段和第二片段的特征,能够使最终得到的特征更加准确地反映第一片段和第二片段的语音特点,再根据第一片段和第二片段的特征进行聚类和分配,能够有效提升聚类和分配的准确性。
在其他可选的实现方式中,也可以不使用语音窗,直接对语音片段提取特征,可以省略求均值的步骤。或者,也可以进行不依赖于embedding特征的角色变更点检测,检测完成后再提取每个语音片段对应的特征用于进行聚类或分配。
可选的,本申请的一个或者多个实施例可以应用于实现无监督的角色分离,其中,无监督角色分离可以是指,在实际角色信息未知的情况下,得到语音中的角色数目和每个角色发言的时间信息。
可选的,在进行聚类时,可以遍历各可选的类别数,依次确定各类别数下的聚类结果,并从中选择最终聚类结果,实现整体无监督的角色分离。
图7为本申请实施例提供的一种聚类方法的流程示意图。如图7所示,对所述多个第一片段进行聚类,可以包括:
步骤701、遍历2至预设类别数,在遍历到的类别数下通过有监督聚类算法对所述多个第一片段进行聚类,得到所述类别数对应的聚类结果。
其中,所述预设类别数可以根据实际需要来设置,本申请实施例中将预设类别数记为M,M为大于2的正整数。遍历2至M,对于遍历到的每一个数值,以该数值作为类别数,进行有监督聚类,得到该类别数下的聚类结果,所述聚类结果用于表示该类别数下聚类得到的类别以及每个类别对应的类别中心。
可选的,可以使用kmeans(k均值)聚类算法实现所述多个第一片段的聚类。
示例性地,可以先选取2作为kmeans算法的类别数,然后初始化两个类别对应的类别中心并进行聚类,得到的聚类结果表示所述多个第一片段中每个第一片段属于这两个类别中的哪一个,以及聚类后确定的类别中心;类似地,再选择3作为类别数,得到对应的聚类结果;以此类推,直到得到2至M中各个类别数对应的聚类结果。
步骤702、根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果。
可选的,根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果,可以通过如下方式实现。
将当前类别数设置为所述预设类别数,并重复执行下述步骤,直至得到最终聚类结果:计算当前类别数下聚类结果的类间距离和类内距离;若类间距离和类内距离满足要求,则所述待处理语音对应的角色数目为所述当前类别数,且最终聚类结果为当前类别数下的聚类结果;若类间距离和类内距离不满足要求,则当前类别数减一。
可选的,所述要求可以根据实际需求来设置,例如,类间距离大于类内距离,或者,类间距离与类内距离的比值位于预设范围内。
示例性地,先计算预设类别数M对应的聚类结果是否满足要求。具体地,可以计算聚类结果中M个类别对应的类内距离和类间距离,若类间距离大于类内距离,则确定满足要求,该聚类结果为最终聚类结果,且待处理语音对应的角色数目为M,每一角色对应一个类别。
若M对应的聚类结果中类间距离小于等于类内距离,则不满足要求,进而计算M-1对应的聚类结果是否满足要求,若满足则为最终聚类结果,反之则继续计算M-2,直至得到满足要求的结果。
通过依次计算各个聚类结果的类间距离和类内距离是否满足要求,可以使最终确定的聚类结果更加精准,提升聚类准确性。
本实施例中,通过遍历2至预设类别数,在遍历到的类别数下通过有监督聚类算法对所述多个第一片段进行聚类,得到所述类别数对应的聚类结果,并根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果,无需事先得知角色数目,也能够快速准确地实现无监督的角色分离。
在其他可选的实现方式中,也可以从预设类别数M开始,直接计算聚类结果并判断是否满足要求,若满足则停止,若不满足则继续进行下一个类别数对应的聚类结果的计算和判断,无需先遍历计算2至M的聚类结果,有效提高聚类的效率。
在其他可选的实现方式中,也可以取消遍历,通过神经网络模型对待处理语音进行分析,得到待处理语音的角色数目,并基于角色数目进行聚类,实现整体无监督的角色分离。
此外,本申请的一个或者多个实施例也可以应用于实现有监督的角色分离。可选的,可以由用户输入角色数目,或者,根据会议信息确定角色数目,再基于角色数目进行聚类,实现整理有监督的角色分离。
在本申请的一个或者多个实施例中,可选的,将所述至少一个第二片段分配到聚类后得到的类别中,可以包括:根据所述第二片段与所述待处理语音的聚类结果中各类别中心的相似度,将所述第二片段分配到对应的类别中。
示例性地,第一片段对应的特征可以为512维的向量,在对多个第一片段进行聚 类后,得到的类别中心用于表征该类别下的第一片段的质心,也可以用512维的向量来表示。
在分配每一第二片段时,可以将所述第二片段对应的特征,即一个512维的向量与每一类别中心计算相似度,并根据相似度确定第二片段所属的类别。
通过先根据多个第一片段的特征进行聚类,并根据得到的聚类结果,将较短的第二片段的特征分配到类别中心上,使得第二片段的特征与分配到的类别的特征更加匹配,提高了第二片段的分配准确性。
在本申请的一个或者多个实施例中,可选的,还可以在聚类后对语音片段进行后处理操作。
可选的,在确定各个语音片段对应的角色后,若存在时长小于预设时长的语音片段,且该语音片段前后相邻的两个语音片段对应同一角色,则将该语音片段对应的角色修改为前后两个语音片段对应的角色,并将该语音片段与前后相邻的两个语音片段合并。
示例性地,所述预设时长可以为0.5秒,在聚类和分配操作后,若任一语音片段小于0.5秒,对应于角色A,而其前一语音片段和后一语音片段均对应于角色B,则可以将该语音片段对应的角色由A修正为B,实现角色分离的平滑处理,提升用户体验。
可选的,若存在时长小于预设时长的语音片段,且该语音片段的前一语音片段和后一语音片段对应不同的角色,则可以根据特征相似度,将该语音片段与前一语音片段或后一语音片段进行合并。
图8为本申请实施例提供的另一种语音处理方法的流程示意图。如图8所示,可以采用并行化特征提取加先聚类再分配的方法实现角色分离,具体可以包括如下步骤。
步骤a、对待处理语音做VAD,去除语音中无效语音,得到有效语音片段。
如图8所示,有效语音片段可以包括VAD片段1、VAD片段2、……、VAD片段n。
步骤b、对每个有效语音片段提取嵌入特征。
可选的,为了提高处理速度,可以采用并行化的处理方式。在每个有效语音片段上按照1.5秒的窗长、0.75秒的滑动时长,采用xvector提取每个语音窗的embedding特征。
步骤c、对每个VAD片段进行CPD检测,得到VAD片段中的角色变更点信息。
可选的,对于每个VAD片段,可以利用相邻语音窗的embedding特征实现CPD检测。CPD检测完成后,可以进行后处理操作,将VAD加CPD分割得到的语音片段进行修正。修正后可以得到语音片段对应的特征。
通过上述方法,可以得到VAD片段1、VAD片段2、……、VAD片段n中每一语音片段对应的特征。
步骤d、对并行化的特征按时间顺序进行拼接,结合角色变更点信息得到多个语音片段,所述多个语音片段按语音窗数量进行分类。
可选的,本步骤可以包括特征拼接、合并再分段。
其中,拼接可以是指将并行处理得到的多个特征按照时间顺序拼接起来,合并再分段可以是指合并后的特征,按照角色变更点进行再分段,得到多个语音片段。按照每个语音片段包含的语音窗数量,将语音片段分为长片段和短片段,分别对应于前述第一片段和第二片段。
步骤e、将长片段求均值,从2到最大角色数遍历进行有监督kmeans聚类。
可选的,可以将步骤d中得到的长片段包含的语音窗求均值,得到每个长片段对应的特征,并通过Kmeans聚类算法和Speakercount得到的聚类结果。其中,Speakercount可以是指说话人人数,也就是角色数目,可以从2到最大角色数(即预设类别数)遍历进行有监督kmeans聚类。
步骤f、使用聚类结果判断角色数目。
可选的,可以从最大角色数到2,计算不同类别数下的聚类结果的类间距离和类内距离,当类间距离大于类内距离时,得到的类别数及聚类结果即为最终结果。
步骤g、将短片段按相似度分配到步骤f得到的类别中心上。
可选的,可以将步骤d中得到的短片段,按包含的语音窗的特征求均值,得到每个短片段对应的特征,并根据特征与类别中心的相似度,将短片段分配到对应的类别中心上,得到分配(assignment)结果。
步骤h、对结果进行后处理,对与前后角色信息不一致的点进行结果更新。
可选的,经过前述步骤a至步骤g,可以得到各个语音片段对应的类别,每一类别对应一角色ID,为了提高正确率,可以进行后处理操作,对很短的语音片段(例如小于0.5秒)所对应的角色进行修正。
本方案在聚类时对片段按连续时长(例如5个语音窗为界限)进行分类,先对长片段进行聚类,再将短片段分到聚类中心上,同时通过后处理操作对前后结果不一致的点进行更新,大大减少碎片化的问题,提升了用户的体验效果;并且,本方案避免了使用阈值的方式决定聚类终止条件,效果更稳定,在不同环境下具有更好的鲁棒性。在同一测试集上,传统方法角色分离正确率大约为65%,本方案可达到92%的分离准确率。
在上述各实施例提供的技术方案的基础上,可选的,embedding特征提取方法还可以采用不同的神经网络结果,例如TDNN(Time Delay Neural Network,时间延迟网络)、Resnet等,聚类方法可采用kmeans或其他聚类方法,例如AHC(Agglomerative hierarchical clustering,层次聚类算法)、各种社区聚类方法等。
图9为本申请实施例提供的又一种语音处理方法的流程示意图。如图9所示,所述方法包括:
步骤901、对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段。
其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
可选的,可以通过所述语音片段的长度、所述语音片段在所述待处理语音中的位置、深度学习模型中的至少一项确定所述语音片段的可信度。在所述多个语音片段中,可信度大于预设值的,被划分为第一片段,可信度小于预设值的,被划分为第二片段。
在一种可选的实现方式中,可以通过语音片段的长度确定可信度。长度越长,可信度越高,长度越短,可信度越低。
相应的,可以将所述多个语音片段按长度划分为多个第一片段以及至少一个第二片段,任一第一片段的长度大于任一第二片段的长度。所述长度可以通过语音片段的时长或者包含的语音窗的数量表示。
进一步地,可以根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段,再区分第一片段和第二片段。具体的处理方法可以参见前述各实施例,此处不再赘述。
在另一种可选的实现方式中,可以通过语音片段在待处理语音中的位置,确定所述语音片段的可信度。例如,在会议开始、结束时可能比较嘈杂,因此,开始和结束位置的语音片段的可信度可以小于其他位置的语音片段的可信度。
可选的,也可以由用户输入可信度较低的语音片段的位置。例如,用户可以根据实际会议情况,输入会议的各个阶段在待处理语音中的位置,讨论阶段的可信度要小于个人发言阶段的可信度,从而能够从多个语音片段中筛选出较为合适的片段进行聚类,再将其他片段分配到聚类结果中,具有较快的处理速度,并且可以满足不同会议 场景下的需求。
在又一种可选的实现方式中,可以通过深度学习模型来计算各个语音片段的可信度。可选的,可以通过训练样本对深度学习模型进行训练,训练样本可以包括语音样本与对应的标签,标签可以通过人工打标的方式获取。训练完成后,可以将待处理语音输入到深度学习模型中,确定对应的可信度。通过深度学习模型可以更加快速准确地确定语音片段的可信度。
此外,还可以通过所述语音片段的时长、所述语音片段在所述待处理语音中的位置、深度学习模型中的至少两项结合起来确定可信度。
一个示例中,可以将语音片段的时长和位置进行结合分析,若时长和位置均满足一定的要求,才被划分为第一片段,反之则被划分为第二片段。
另一示例中,可以将语音片段的时长和深度学习模型进行结合分析,只有时长大于一定阈值的,才被送入深度学习模型进行可信度的预测,并根据预测结果判断是属于第一片段还是第二片段,时长较短的,直接被划分为第二片段。
又一示例中,可以将语音片段的时长、位置和深度学习模型进行结合分析,若时长和位置均满足一定的要求,才被送入深度学习模型进行可信度的预测,并根据预测结果判断是属于第一片段还是第二片段,若时长和位置不满足一定的要求,直接被划分为第二片段。
通过将语音片段的时长、位置和深度学习模型进行综合分析,能够更加准确地确定语音片段的可信度,提高后续聚类和分配的效果。
步骤902、对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
本步骤的具体实现原理和过程可以参见前述各实施例,此处不再赘述。
本实施例提供的语音处理方法,可以对待处理语音进行分割,得到多个语音片段,其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果,从而可以实现基于可信度较高的片段的聚类结果指导可信度较低的片段的分类,大大减少了碎片化的问题,明显提升用户体验效果,且不依赖于使用阈值决定聚类终止条件,在不同环境下具有更好的鲁棒性,有效提高了角色分离的准确性和稳定性。
图10为本申请实施例提供的一种语音处理装置的结构示意图。如图10所示,所述语音处理装置可以包括:
获取模块1001,用于获取会议系统采集的多个参会角色对应的单通道语音;
第一分割模块1002,用于根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
第一处理模块1003,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果;
输出模块1004,用于根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
本实施例提供的语音处理装置,可以用于执行图1至图3所示实施例提供的技术方案,其实现原理和技术效果类似,此处不再赘述。
图11为本申请实施例提供的另一种语音处理装置的结构示意图。如图11所示,所述语音处理装置可以包括:
第二分割模块1101,用于根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
第二处理模块1102,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
在本申请的一个或多个实施例中,所述第二分割模块1101具体用于:通过语音活动端点检测确定所述待处理语音中的至少一个有效语音片段;对所述有效语音片段进行角色变更点检测,并根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段;其中,每一语音片段为单个角色对应的语音。
在本申请的一个或多个实施例中,所述第二分割模块1101在对所述有效语音片段进行角色变更点检测时,具体用于:基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征;根据相邻语音窗的特征的相似度,确定角色变更点信息。
在本申请的一个或多个实施例中,所述第二分割模块1101在基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征时,具体用于:采用多线程对各有效语音片段进行并行化处理,对每一有效语音片段,基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征;在本申请的一个或多个实施例中,所述第二分割模块1101在根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段时,具体用于:将并行化处理后得到的特征按时间顺序进行拼接,并结合角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段。
在本申请的一个或多个实施例中,若所述语音片段包含的语音窗的数量大于数量阈值,则所述语音片段为第一片段;若所述语音片段包含的语音窗的数量小于数量阈值,则所述语音片段为第二片段。
在本申请的一个或多个实施例中,所述第二处理模块1102具体用于:对于每一第一片段,将所述第一片段对应的至少一个语音窗的特征求均值,得到所述第一片段对应的特征,并根据多个第一片段对应的特征,对多个第一片段进行聚类;对于每一第二片段,将所述第二片段对应的至少一个语音窗的特征求均值,得到所述第二片段对应的特征,并根据至少一个第二片段对应的特征,将所述至少一个第二片段分配到聚类后得到的类别中。
在本申请的一个或多个实施例中,所述第二处理模块1102在对所述多个第一片段进行聚类时,具体用于:遍历2至预设类别数,在遍历到的类别数下通过有监督聚类算法对所述多个第一片段进行聚类,得到所述类别数对应的聚类结果;根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果。
在本申请的一个或多个实施例中,所述第二处理模块1102在根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果时,具体用于:将当前类别数设置为所述预设类别数,并重复执行下述步骤,直至得到最终聚类结果:计算当前类别数下聚类结果的类间距离和类内距离;若类间距离和类内距离满足要求,则所述待处理语音对应的角色数目为所述当前类别数,且最终聚类结果为当前类别数下的聚类结果;若类间距离和类内距离不满足要求,则当前类别数减一。
在本申请的一个或多个实施例中,所述第二处理模块1102在将所述至少一个第二片段分配到聚类后得到的类别中时,具体用于:根据所述第二片段与所述待处理语音的聚类结果中各类别中心的相似度,将所述第二片段分配到对应的类别中。
在本申请的一个或多个实施例中,所述第二处理模块1102还用于:若分割得到的 多个语音片段中,存在包含的语音窗数量小于预设阈值的语音片段,则将该语音片段与相邻的语音片段进行合并,并根据合并操作后得到的语音片段,区分第一片段和第二片段;和/或,在确定各个语音片段对应的角色后,若存在时长小于预设时长的语音片段,且该语音片段前后相邻的两个语音片段对应同一角色,则将该语音片段与前后相邻的两个语音片段合并。
本实施例提供的语音处理装置,可以用于执行图4至图8所示实施例提供的技术方案,其实现原理和技术效果类似,此处不再赘述。
图12为本申请实施例提供的又一种语音处理装置的结构示意图。如图12所示,所述语音处理装置可以包括:
第三分割模块1201,用于对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段;
第三处理模块1202,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;
其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
在本申请的一个或多个实施例中,第三分割模块1201还用于:通过所述语音片段的长度、所述语音片段在所述待处理语音中的位置、深度学习模型中的至少一项确定所述语音片段的可信度。
本实施例提供的语音处理装置,可以用于执行图9所示实施例提供的技术方案,其实现原理和技术效果类似,此处不再赘述。
图13为本申请实施例提供的一种语音处理设备的结构示意图。如图13所示,本实施例的语音处理设备可以包括:至少一个处理器1301;以及与所述至少一个处理器通信连接的存储器1302;其中,所述存储器1302存储有可被所述至少一个处理器1301执行的指令,所述指令被所述至少一个处理器1301执行,以使所述语音处理设备执行如上述任一实施例所述的方法。
可选地,存储器1302既可以是独立的,也可以跟处理器1301集成在一起。
本实施例提供的语音处理设备的实现原理和技术效果可以参见前述各实施例,此处不再赘述。
图14为本申请实施例提供的另一种语音处理设备的结构示意图。如图14所示,本实施例的语音处理设备可以包括:处理装置1402以及与所述处理装置通信连接的下述至少一项:语音输入装置1401、显示装置1403;
其中,所述语音输入装置1401用于采集待分析语音并发送给所述处理装置1402;所述显示装置1403用于显示所述处理装置1402确定的角色分离结果和/或通过所述角色分离结果确定的语音转文信息;所述处理装置1402,用于执行前述任一实施例所述的语音处理方法。
可选的,所述语音输入装置1401可以为麦克风等能够采集语音的装置,所述显示装置1403可以为显示屏等具有显示功能的装置。
可选的,处理装置1402、语音输入装置1401、显示装置1403可以是集成在一起的,也可以是分离设置的。语音输入装置1401、显示装置1403和处理装置1402可以通过有线或无线的方式实现通信连接。
显示装置1403可以显示所述处理装置1402确定的角色分离结果,例如显示第几秒至第几秒为哪个角色在发言,或者,可以显示通过所述角色分离结果确定的语音转文信息,所述语音转文信息可以为包含了角色分离结果的文本信息,所述文本信息为待处理语音对应的文本信息,例如所述语音转文结果可以为图3右侧显示的内容,当然,也可以将角色分离结果和语音转文信息同时或先后进行显示,方便用户查看会议 记录,提高用户体验。
本实施例提供的语音处理设备的实现原理和技术效果可以参见前述各实施例,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现前述任一实施例所述的方法。
本申请实施例还提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现前述任一实施例所述的方法。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的部分步骤。
应理解,上述处理器可以是中央处理单元(Central Processing Unit,简称CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合申请所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,还可以为U盘、移动硬盘、只读存储器、磁盘或光盘等。
上述存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。存储介质可以是通用或专用计算机能够存取的任何可用介质。
一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称ASIC)中。当然,处理器和存储介质也可以作为分立组件存在于电子设备或主控设备中。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可 以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (16)

  1. 一种语音处理方法,其特征在于,包括:
    获取会议系统采集的多个参会角色对应的单通道语音;
    根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
    对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果;
    根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
  2. 一种语音处理方法,其特征在于,包括:
    根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
    对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
  3. 根据权利要求2所述的方法,其特征在于,根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段,包括:
    通过语音活动端点检测确定所述待处理语音中的至少一个有效语音片段;
    对所述有效语音片段进行角色变更点检测,并根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段;
    其中,每一语音片段为单个角色对应的语音。
  4. 根据权利要求3所述的方法,其特征在于,对所述有效语音片段进行角色变更点检测,包括:
    基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征;
    根据相邻语音窗的特征的相似度,确定角色变更点信息。
  5. 根据权利要求4所述的方法,其特征在于,基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征,包括:
    采用多线程对各有效语音片段进行并行化处理,对每一有效语音片段,基于预设窗长和/或滑动时长确定所述有效语音片段对应的至少一个语音窗,并提取所述语音窗的特征;
    相应的,根据得到的角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段,包括:
    将并行化处理后得到的特征按时间顺序进行拼接,并结合角色变更点信息,将所述至少一个有效语音片段分割为所述多个语音片段。
  6. 根据权利要求4或5所述的方法,其特征在于,对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,包括:
    对于每一第一片段,将所述第一片段对应的至少一个语音窗的特征求均值,得到所述第一片段对应的特征,并根据多个第一片段对应的特征,对多个第一片段进行聚类;
    对于每一第二片段,将所述第二片段对应的至少一个语音窗的特征求均值,得到所述第二片段对应的特征,并根据至少一个第二片段对应的特征,将所述至少一个第二片段分配到聚类后得到的类别中。
  7. 根据权利要求2-6任一项所述的方法,其特征在于,对所述多个第一片段进行聚类,包括:
    遍历2至预设类别数,在遍历到的类别数下通过有监督聚类算法对所述多个第一片段进行聚类,得到所述类别数对应的聚类结果;
    根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果。
  8. 根据权利要求7所述的方法,其特征在于,根据不同类别数对应的聚类结果,确定所述待处理语音对应的角色数目和聚类结果,包括:
    将当前类别数设置为所述预设类别数,并重复执行下述步骤,直至得到最终聚类结果:
    计算当前类别数下聚类结果的类间距离和类内距离;
    若类间距离和类内距离满足要求,则所述待处理语音对应的角色数目为所述当前类别数,且最终聚类结果为当前类别数下的聚类结果;
    若类间距离和类内距离不满足要求,则当前类别数减一。
  9. 一种语音处理方法,其特征在于,包括:
    对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段;
    对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;
    其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
  10. 一种语音处理装置,其特征在于,包括:
    获取模块,用于获取会议系统采集的多个参会角色对应的单通道语音;
    第一分割模块,用于根据所述单通道语音中的角色变更点信息,对所述单通道语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述单通道语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
    第一处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述单通道语音的角色分离结果;
    输出模块,用于根据所述角色分离结果与所述单通道语音对应的文本信息,输出各个参会角色对应的发言文本。
  11. 一种语音处理装置,其特征在于,包括:
    第二分割模块,用于根据待处理语音中的角色变更点信息,对所述待处理语音进行分割,得到多个语音片段;其中,所述角色变更点信息用于表示所述待处理语音中发言角色发生变更的位置;所述多个语音片段包括多个第一片段和至少一个第二片段,且任一第一片段的长度大于任一第二片段的长度;
    第二处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果。
  12. 一种语音处理装置,其特征在于,包括:
    第三分割模块,用于对待处理语音进行分割,得到多个语音片段;其中,所述多个语音片段包括多个第一片段以及可信度小于所述第一片段的至少一个第二片段;
    第三处理模块,用于对所述多个第一片段进行聚类,并将所述至少一个第二片段分配到聚类后得到的类别中,得到所述待处理语音的角色分离结果;
    其中,所述语音片段的可信度用于表征基于所述语音片段进行聚类得到的聚类结果的可信度。
  13. 一种语音处理设备,其特征在于,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;
    其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述语音处理设备执行权利要求1-9任一项所述的方法。
  14. 一种语音处理设备,其特征在于,包括:处理装置以及与所述处理装置通信连接的下述至少一项:语音输入装置、显示装置;
    其中,所述语音输入装置用于采集待分析语音并发送给所述处理装置;
    所述显示装置用于显示所述处理装置确定的角色分离结果和/或通过所述角色分离结果确定的语音转文信息;
    所述处理装置,用于执行权利要求1-9任一项所述的方法。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1-9任一项所述的方法。
  16. 一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1-9任一项所述的方法。
PCT/CN2022/133015 2021-11-18 2022-11-18 语音处理方法、设备及存储介质 WO2023088448A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111365392.8 2021-11-18
CN202111365392.8A CN113808612B (zh) 2021-11-18 2021-11-18 语音处理方法、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023088448A1 true WO2023088448A1 (zh) 2023-05-25

Family

ID=78938323

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133015 WO2023088448A1 (zh) 2021-11-18 2022-11-18 语音处理方法、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113808612B (zh)
WO (1) WO2023088448A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808612B (zh) * 2021-11-18 2022-02-11 阿里巴巴达摩院(杭州)科技有限公司 语音处理方法、设备及存储介质
CN114822511A (zh) * 2022-06-29 2022-07-29 阿里巴巴达摩院(杭州)科技有限公司 语音检测方法、电子设备及计算机存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7295970B1 (en) * 2002-08-29 2007-11-13 At&T Corp Unsupervised speaker segmentation of multi-speaker speech data
CN110517667A (zh) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 一种语音处理方法、装置、电子设备和存储介质
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111899755A (zh) * 2020-08-11 2020-11-06 华院数据技术(上海)有限公司 一种说话人语音分离方法及相关设备
CN113192516A (zh) * 2021-04-22 2021-07-30 平安科技(深圳)有限公司 语音角色分割方法、装置、计算机设备及存储介质
CN113808612A (zh) * 2021-11-18 2021-12-17 阿里巴巴达摩院(杭州)科技有限公司 语音处理方法、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102074236B (zh) * 2010-11-29 2012-06-06 清华大学 一种分布式麦克风的说话人聚类方法
US20210090561A1 (en) * 2019-09-24 2021-03-25 Amazon Technologies, Inc. Alexa roaming authentication techniques

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7295970B1 (en) * 2002-08-29 2007-11-13 At&T Corp Unsupervised speaker segmentation of multi-speaker speech data
CN110517667A (zh) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 一种语音处理方法、装置、电子设备和存储介质
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111899755A (zh) * 2020-08-11 2020-11-06 华院数据技术(上海)有限公司 一种说话人语音分离方法及相关设备
CN113192516A (zh) * 2021-04-22 2021-07-30 平安科技(深圳)有限公司 语音角色分割方法、装置、计算机设备及存储介质
CN113808612A (zh) * 2021-11-18 2021-12-17 阿里巴巴达摩院(杭州)科技有限公司 语音处理方法、设备及存储介质

Also Published As

Publication number Publication date
CN113808612A (zh) 2021-12-17
CN113808612B (zh) 2022-02-11

Similar Documents

Publication Publication Date Title
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
WO2023088448A1 (zh) 语音处理方法、设备及存储介质
CN107305541B (zh) 语音识别文本分段方法及装置
CN105405439B (zh) 语音播放方法及装置
CN111128223B (zh) 一种基于文本信息的辅助说话人分离方法及相关装置
CN107562760B (zh) 一种语音数据处理方法及装置
WO2018108080A1 (zh) 一种基于声纹搜索的信息推荐方法及装置
US20160283185A1 (en) Semi-supervised speaker diarization
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN106782615A (zh) 语音数据情感检测方法和装置及系统
CN111797632B (zh) 信息处理方法、装置及电子设备
WO2020147256A1 (zh) 会议内容区分方法、装置、计算机设备及存储介质
CN109036471B (zh) 语音端点检测方法及设备
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
CN108305618B (zh) 语音获取及搜索方法、智能笔、搜索终端及存储介质
CN111326139B (zh) 一种语种识别方法、装置、设备及存储介质
CN111128128B (zh) 一种基于互补模型评分融合的语音关键词检测方法
CN111462758A (zh) 智能会议角色分类的方法、装置、设备及存储介质
CN114141252A (zh) 声纹识别方法、装置、电子设备和存储介质
WO2021196390A1 (zh) 声纹数据生成方法、装置、计算机装置及存储介质
CN106710588B (zh) 语音数据句类识别方法和装置及系统
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN114049898A (zh) 一种音频提取方法、装置、设备和存储介质
CN112992175B (zh) 一种语音区分方法及其语音记录装置
CN114822557A (zh) 课堂中不同声音的区分方法、装置、设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894977

Country of ref document: EP

Kind code of ref document: A1