WO2020238209A1

WO2020238209A1 - Audio processing method, system and related device

Info

Publication number: WO2020238209A1
Application number: PCT/CN2019/130550
Authority: WO
Inventors: 周维聪; 涂臻
Original assignee: 深圳追一科技有限公司
Priority date: 2019-05-28
Filing date: 2019-12-31
Publication date: 2020-12-03
Also published as: CN110335621A

Abstract

An audio processing method, a system and a related device. The method comprises: acquiring audio data (S201); acquiring human voice audio data and switching points according to the audio data (S202); converting the human voice audio data into m single sound data according to the switching points (S203); clustering the m single sound data to obtain n audio groups (S204), wherein the single sound data of each audio group in the n audio groups belongs to the same speaker.

Description

Audio processing method, system and related equipment

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2019. The application number is 2019104534937 and the application title is "Methods, Systems and Related Equipment for Audio Processing". The entire contents of this application are incorporated herein by reference. Applying.

Technical field

This application relates to audio processing methods, systems and related equipment.

Background technique

Currently, voice technology has been integrated into people's lives, bringing people a more convenient lifestyle. With the continuous improvement of audio processing technology, obtaining specific human voices of interest from massive audio data (such as telephone recordings, news broadcasts, conference recordings, etc.) has become a major research focus. Among them, one of the methods to obtain a specific human voice is through the Speaker Diarization (SD) system. Speaker separation refers to the audio data in a multi-person dialogue, divided according to the speaker, and labeled process.

The existing speaker separation system achieves a practical level of accuracy in a clean near-field environment, but when the environment is relatively complex, the important speaker independent single-channel speech separation has low separation accuracy.

Summary of the invention

According to various embodiments disclosed in this application, an audio processing method, system and related equipment are provided.

An audio processing method, including: obtaining audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and obtaining the human voice audio data according to the audio data And a switching point, wherein the human voice audio data is the human voice generated when the n interlocutors are engaged in a conversation after the noise is removed from the audio data, and the switching point is any of the n interlocutors The conversation time point of one interlocutor switching to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, wherein each of the m monophonic data The data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups Belong to the same interlocutor.

An audio processing method includes: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and converting the audio data by a voice recognition method Is the corresponding text data; according to the audio data and text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice data is converted into m monophonic data, wherein the m Each monophonic data in the monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the n audio groups The monophonic data of each audio group in the audio group belongs to the same interlocutor; the interlocutor to which each audio group in the n audio groups belongs is confirmed according to the audio text.

An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; the acquisition unit, The obtaining unit is configured to obtain human voice audio data and a switching point according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the audio data. Any one of the n interlocutors switches to another interlocutor’s conversation time point; a conversion unit, the conversion unit is used for switching points, and converts the human voice audio data into m monophonic data, wherein , Each of the m monophonic data only contains the human voice of a single interlocutor, and m is a positive integer; a clustering unit, the clustering unit is used to cluster the m monophonic data , Obtain n audio groups, wherein the monophonic data in each audio group in the n audio groups belong to the same interlocutor.

An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; a first conversion Unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method; the obtaining unit is used to obtain human voice audio data and switching points according to the audio data and audio text, Wherein, the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor A second conversion unit, the second conversion unit is used to convert the human voice audio data into m monophonic data according to the human voice audio data and switching points, wherein the m monophonic data Each monophonic data of only includes the human voice of a single interlocutor, m is a positive integer; clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, where, The monophonic data of each audio group in the n audio groups belong to the same interlocutor; a confirmation unit, the confirmation unit is used to confirm according to the audio text to which each audio group in the n audio groups belongs interlocutor.

A server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and the human voice produced by n interlocutors in a dialogue, where n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the The audio data removes the noise and obtains the human voice generated by n interlocutors during a conversation, and the switching point is the conversation time point when any interlocutor of the n interlocutors switches to another interlocutor; The switching point converts the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The m monophonic data are clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor.

A server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and human voices produced by n interlocutors during a conversation, where n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; according to the audio data and text data, Obtain human voice audio data and a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is any one of the n interlocutors switching To another interlocutor’s dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each of the m monophonic data includes only a single dialogue The human voice of the speaker, m is a positive integer; cluster the m monophonic data to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the same interlocutor ; Confirm the interlocutor to which each audio group in the n audio groups belongs according to the audio text.

A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the audio data minus the The human voice generated when n interlocutors are engaged in a conversation obtained after the noise, the switching point is the conversation time point when any one of the n interlocutors switches to another interlocutor; according to the switching point, Convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The acoustic data is clustered to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the same interlocutor.

A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; the human voice audio data is obtained according to the audio data and the text data And a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is that any one of the n interlocutors switches to another interlocutor The dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor; according to the audio The text identifies the interlocutor to which each audio group in the n audio groups belongs.

A computer program product, when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors obtained by removing the noise from the audio data , The switching point is a dialogue time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, Wherein, each monophonic data in the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, where all The monophonic data of each audio group in the n audio groups belong to the same interlocutor.

A computer program product, when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; the audio data is converted into corresponding text data through the voice recognition method; the human voice audio data and the switching point are obtained according to the audio data and text data, wherein the human voice audio data is the The audio data is the audio data obtained after the noise is removed, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the person The sound data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; the m monophonic data are clustered , Obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor; according to the audio text, it is confirmed that each audio group in the n audio groups belongs to interlocutor.

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is a schematic structural diagram of a speaker separation system provided by the present application.

Fig. 2 is a schematic flowchart of an audio processing method provided by the present application.

FIG. 3 is a schematic flowchart of clustering m monophonic data into n audio groups in an application scenario provided by the present application.

Fig. 4 is a detailed flowchart of an audio processing method provided by the present application.

Fig. 5 is a schematic flowchart of another audio processing method provided by the present application.

Fig. 6 is a detailed flowchart of another audio processing method provided by the crowd.

Fig. 7 is a schematic structural diagram of an audio processing system provided by the present application.

Fig. 8 is a schematic structural diagram of another audio processing system provided by the present application.

Fig. 9 is a schematic structural diagram of a server provided by the present application.

Detailed ways

The application will be further described in detail below through specific implementations in conjunction with the drawings. In the following embodiments, many detailed descriptions are used to make the present application better understood. However, those skilled in the art can easily realize that some of the features can be omitted under different circumstances, or can be replaced by other methods. In some cases, some operations related to this application are not shown or described in the specification. This is to avoid the core part of this application being overwhelmed by excessive description. For those skilled in the art, it is not necessary to describe these related operations in detail. They can fully understand the related operations based on the description in the specification and general technical knowledge in the field.

In order to facilitate the better understanding of this solution, the following is a brief introduction to the "talker separation system".

Speaker separation refers to the process of dividing and labeling audio data in a multi-person conversation according to the speakers. In specific implementation, the speaker separation system is generally based on the Bayesian Information Criterion (BIC) as a similarity measure for speaker separation. As shown in Figure 1, the technology mainly passes through the input module 101 and the silence detection module in turn. 103. The speaker recognition module 103, the switching point detection module 104, the classification module 105, and the output module 106, so as to obtain the separated audio result. Among them, the silence detection module 102 is used to remove the silence part of the input audio data to obtain the second audio data; the speaker recognition module 103 learns the voiceprint characteristics of speakers in a large number of business scenarios, for example, the speaker separation system is used for Separate the voices of the customer service and the user, then the speaker recognition module will learn a large number of customer service and user voiceprint characteristics, such as the voice intonation characteristics, prosody characteristics of the customer service, etc., so that the speaker recognition module can, based on the second audio data, The identity of the speaker of each dialogue in the current audio data is determined to obtain the third audio data; the switching point detection module 104 is used to determine the dialogue switching point of the speaker according to the third audio data; the conversion module 105 is used to Switching points, thereby editing the third audio data into multiple pieces of audio data; the classification module 106 is used to classify the multiple pieces of audio data according to the identity of the speaker detected by the speaker recognition module 103 to obtain the speaker separation result.

In summary, the system not only needs to train more models for business scene data (silence detection model, speaker recognition model, and switching point model) in the training phase, but also has a longer detection process in the detection phase, which needs to go through Figure 1 in turn. Only the seven modules shown can obtain the test results, which requires a lot of time.

Therefore, this application provides an audio processing method, which has a short detection process time, a small number of training models, and a greatly reduced time overhead. Fig. 2 is an audio processing method provided by the present application. The method includes the following steps:

S201: Acquire audio data.

In a specific implementation, the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer. The audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on. Among them, the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on. It should be noted that human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need. The target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

S202: Obtain human voice audio data and switching points according to the audio data.

In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current audio data, or the time axis of the audio data. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that, in order to more clearly express the main idea of the audio processing method provided by this application, the content of the dialogue is directly expressed in the form of text in the above example, but in the actual processing process, since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B. In addition, the above examples are only for illustration and cannot constitute a specific limitation.

In a specific embodiment, the obtaining human voice audio data and switching points according to the audio data includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice The acoustic separation model is a model obtained after training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data is input into the speaker change detection model (Speaker Change Detection, SCD) to obtain the The switching point. Wherein, the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples. The human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.

It should be noted that in step S202, since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers. The time required for separation. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.

S203: According to the switching point, convert the human voice audio data into m monophonic data.

In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. In addition, in order to facilitate a better understanding of the definition of the monophonic data, FIG. 3 shows the monophonic data in the form of text. However, in actual situations, since no speech recognition is performed in step S203, the monophonic data It is only audio data, and does not contain text information.

S204: Cluster the m monophonic data to obtain n audio groups.

In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: "I want to check the phone bill" and "Yes" corresponding to the 2 monophonic data, the other type of audio group is: "Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.

In a specific implementation, after clustering the m monophonic data to obtain n audio groups, the method further includes: using a speech recognition method (Automatic Speech Recognition, ASR) to combine all The m monophonic data are converted into corresponding m text information; according to the m text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed. In specific implementation, the interlocutor described in each audio group can be determined by keywords or key words in the text information. For example, taking the above example as an example, you can determine according to the keywords "Excuse me" and "Please wait" The interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc. , This application does not specifically limit the setting of keywords.

In specific implementation, dynamic time warping (Dynamic Time Warping, DTW), Hidden Markov Model (HMM) theory, vector quantization (VQ) technology, and artificial neural network (Artificial Neural Network, ANN) ) And other voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.

It should be noted that since the input data in the clustering process is m monophonic data, the input data in the speech recognition process can also be m monophonic data, so the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience. Figure 4 shows a schematic flow chart of an audio processing method provided by the present application. It can be seen from Figure 4 that, compared with the speaker separation method in the embodiment of Figure 1, the audio processing method provided by the present application only needs to focus on services during the training phase. The scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation The model removes the interference of noise information, and the accuracy is greatly improved.

In the above method, by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m The monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.

Figure 5 is another audio processing method provided by this application. The difference between the audio processing method shown in FIG. 5 and the audio processing method shown in FIG. 2 is that the process of voice recognition is performed in advance to obtain the text characteristics of the audio data, so that the switching point detection model can comprehensively consider the audio characteristics and The text feature can detect the interlocutor’s switching point more accurately. The audio processing method shown in FIG. 5 includes the following steps:

S501: Acquire audio data.

S502: Convert the audio data into corresponding text data through a voice recognition method.

In specific implementation, the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.

S503: Obtain human voice audio data and switching points according to the audio data and text data.

In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points. The above examples are only for illustration and do not constitute a specific limitation.

In a specific embodiment, the obtaining human voice audio data and switching points according to the audio data and audio text includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein The human voice separation model is a model obtained by training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain all The switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The human voice separation model may specifically be a VAD model, which uses a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, so as to obtain the human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.

It should be noted that since the input data of the speech recognition step is audio data, and the input data of the human voice separation step is audio data, the two steps can be performed simultaneously, and then the switching point detection step is performed. Alternatively, the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. In this way, tasks are executed in parallel as much as possible, and the time required for the speaker separation process is minimized. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.

It is understandable that the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.

S504: According to the switching point, convert the human voice data into m monophonic data.

In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. Moreover, since the audio data has been speech recognized in step S502, the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .

S505: Cluster the m monophonic data to obtain n audio groups.

In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: "I think 2 monophonic data corresponding to "check call charge" and "yes", the other type of audio group is: "Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan" corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.

S506: Confirm the interlocutor to which each audio group in the n audio groups belongs according to the audio text.

In specific implementation, the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait" to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and the interlocutor of category 2 is determined to be customer service, etc. The application does not specifically limit the setting of keywords.

Fig. 6 shows a schematic flow chart of an audio processing method provided by this application. It can be seen from Fig. 6 that, compared with the speaker separation method in the embodiment of Fig. 1, the audio processing method provided by this application only needs to focus on services during the training phase. The scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation The model removes the interference of noise information, and the switching point detection model integrates text features and audio features, which greatly improves the accuracy of switching point detection, thereby improving the accuracy of speaker separation.

In the above method, audio data is acquired; the audio data is converted into corresponding text data through a voice recognition method; human voice audio data and switching points are obtained according to the audio data and text data; according to the switching points, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups The data belong to the same interlocutor, thereby confirming the identity of the human voice to which each audio group in the n audio groups belongs. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.

FIG. 7 is an audio processing system provided by the present application. The audio processing system 700 includes an acquisition unit 710, an acquisition unit 720, a conversion unit 730, and a clustering unit 740, wherein:

The acquiring unit 710 is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

The obtaining unit 720 is configured to obtain human voice audio data and a switching point according to the audio data, where the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is Any one of the n interlocutors switches to the conversation time point of another interlocutor;

The conversion unit 730 is used for switching points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only contains the human voice of a single interlocutor, m is a positive integer;

The clustering unit 740 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same interlocutor .

In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that, in order to more clearly express the main idea of the audio processing method provided by this application, the content of the dialogue is directly expressed in the form of text in the above example, but in the actual processing process, since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B. In addition, the above examples are only for illustration and cannot constitute a specific limitation.

In a specific embodiment, the obtaining unit 720 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples and The model obtained by training the neural network on the corresponding known human voice data samples; at the same time, the audio data is input into the switching point detection model to obtain the switching point, wherein the switching point detection model uses the known audio The model obtained after training the neural network with the samples and the corresponding known switching point samples. Wherein, the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples. The human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.

S204: Cluster the m monophonic data to obtain n audio groups.

In a specific implementation, the system further includes a confirmation unit 750, which is configured to: after clustering the m monophonic data to obtain n audio groups, perform voice recognition Method: Convert the m monophonic data into corresponding m text information; according to the m text information, confirm the interlocutor to which each audio group of the n audio groups belongs. In specific implementation, the interlocutor described in each audio group can be determined by keywords or key words in the text information. For example, taking the above example as an example, you can determine according to the keywords "Excuse me" and "Please wait" The interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc. , This application does not specifically limit the setting of keywords.

It should be noted that since the input data in the clustering process is m monophonic data, the input data in the speech recognition process can also be m monophonic data, so the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience. It can be seen from FIG. 7 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application needs to train the human voice separation model and the switching point detection model only for the business scene data in the training phase. The cost of labor and labor is greatly reduced, and the steps in the detection stage can be processed in parallel, which greatly shortens the time required for speaker separation. Moreover, the use of the human voice separation model to remove the interference of noise information greatly improves the accuracy.

In the above system, by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m The monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.

Fig. 8 is another audio processing system provided by the present application. The difference between the audio processing system shown in Fig. 8 and the audio processing system shown in Fig. 7 is that the process of voice recognition is performed in advance to obtain audio data. Text features enable the switching point detection model to comprehensively consider audio features and text features, and more accurately detect the switch point of the interlocutor. The audio processing system 800 shown in FIG. 8 includes an acquisition unit 810, a first conversion unit 820, an acquisition unit 830, a second conversion unit 840, a clustering unit 850, and a confirmation unit 860, where

The acquiring unit 810 is configured to acquire audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

The conversion unit 820 is configured to convert the audio data into audio text through a voice recognition method;

The obtaining unit 830 is configured to obtain human voice audio data and switching points according to the audio data and audio text, where the human voice audio data is audio data obtained by removing the noise from the audio data, and The switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;

The second conversion unit 840 is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein each of the m monophonic data The data only includes the human voice of a single interlocutor, and m is a positive integer;

The clustering unit 850 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;

The confirmation unit 860 is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.

In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points. The above examples are only for illustration and do not constitute specific limitations.

In a specific embodiment, the obtaining unit 830 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples And the model obtained by training the neural network with the corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain the switching point, wherein the switching point detection model is A model obtained by training the neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The human voice separation model can be a VAD model, which uses a similar event detection scheme to learn the distinguishing characteristics between human voice and non-human voice to obtain a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.

It should be noted that since the input data of the speech recognition step is audio data, and the input data of the human voice separation step is audio data, the two steps can be performed simultaneously, and then the switching point detection step is performed. Alternatively, the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. So as to perform tasks in parallel as much as possible, and minimize the time required for the speaker separation process. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.

In specific implementation, the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait" to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and then the interlocutor of category 2 is determined to be customer service, etc. This application does not specifically limit the setting of keywords.

It can be understood from FIG. 8 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application only needs to train the human voice separation model and the switching point detection model for the business scene data in the training phase. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation. In addition, the human voice separation model is used to remove the interference of noise information, and the switching point detection model is integrated The text features and audio features greatly improve the accuracy of switching point detection, thereby improving the accuracy of speaker separation.

In the above system, by acquiring audio data; using a voice recognition method, the audio data is converted into corresponding text data; according to the audio data and text data, the human voice audio data and the switching point are obtained; according to the switching point, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups The data belong to the same interlocutor, thereby confirming the identity of the human voice to which each audio group in the n audio groups belongs. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.

Refer to FIG. 9, which is a schematic structural diagram of a server provided in this application. The server may implement the method in the embodiment in FIG. 2 or the embodiment in FIG. 5. It should be noted that the data processing method provided in this application can be implemented in a cloud service cluster as shown in FIG. 9 or in a single computer node and a storage node. This application is not specifically limited. The cloud service cluster includes At least one computing node 910 and at least one storage node 920.

The computing node 910 includes one or more processors 911, a communication interface 912, and a memory 913. The processor 911, the communication interface 912, and the memory 913 may be connected through a bus 914.

The processor 911 includes one or more general-purpose processors. The general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on. It can be a dedicated processor used only for the computing node 910 or can be shared with other computing nodes 910. The processor 911 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 913, which enables the computing node 910 to provide a wide variety of services. For example, the processor 911 can execute codes of modules such as clustering, human voice separation, and switching point detection to execute at least a part of the methods discussed herein. The communication interface 912 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users. When the communication interface 912 is a wired interface, the communication interface 912 may adopt a protocol family over TCP/IP, for example, RAAS protocol, Remote Function Call (RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (SNMP) protocol, Common Object Request Broker Architecture (CORBA) protocol, distributed protocol, etc. The memory 913 may include a volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read-Only Memory). Memory, ROM, Flash Memory, Hard Disk Drive (HDD), or Solid-State Drive (SSD) memory may also include a combination of the foregoing types of memories. The storage node 920 includes one or more storage controllers 921 and a storage array 922. Among them, the storage controller 921 and the storage array 922 may be connected through a bus 924. The storage controller 921 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait. It can be a dedicated processor used only for a single storage node 920 or can be shared with the computing node 910 or other storage nodes 920. It can be understood that, in this embodiment, each storage node includes a storage controller. In other embodiments, multiple storage nodes may also share a storage controller, which is not specifically limited here. The memory array 922 may include a plurality of memories 923. The memory 923 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory. For example, the storage array may be composed of multiple HDDs or multiple SDDs, or the storage array may be composed of HDDs and SDDs. Among them, multiple memories are combined in different ways with the assistance of the storage controller 921 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology. Optionally, the memory array 923 may include one or more data centers. Multiple data centers can be set up at the same location, or at different locations, and there is no specific limitation here. The memory array 923 may store program codes and program data. Among them, the program code includes voice recognition module code, semantic understanding module code, order production module code, and so on. The program data includes: clustering module code data, human voice separation module code data, switching point detection module code data, etc., and can also include subscription human voice separation models, switching point detection models, and corresponding training sample set data, etc. The application is not specifically limited.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium, (for example, a floppy disk, a storage disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

The above are only a few embodiments of the application, and those skilled in the art can make various changes or modifications to the application according to the disclosure of the application documents without departing from the spirit and scope of the application.

Claims

An audio processing method, characterized in that the method includes:

Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

According to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors after the noise is removed from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;

Converting the human voice audio data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor.
The method according to claim 1, wherein after clustering the m monophonic data to obtain n audio groups, the method further comprises:

Converting the m monophonic data into corresponding m text information through a voice recognition method;

According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
The method according to claim 1, wherein the obtaining human voice audio data and switching points according to the audio data comprises:

The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is a model obtained after training a neural network using known audio samples and corresponding known human voice data samples ;Simultaneously,

The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
An audio processing method, characterized in that the method includes:

Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

Converting the audio data into corresponding text data through a voice recognition method;

According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;

Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;

Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
The method according to claim 4, wherein the obtaining human voice audio data and switching points according to the audio data and audio text comprises:

The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,

The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
An audio processing system, characterized in that the system includes:

An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

An obtaining unit, the obtaining unit is configured to obtain human voice audio data and switching points according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;

A conversion unit, which is used to switch points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data contains only a single interlocutor’s person Sound, m is a positive integer;

A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same An interlocutor.
The system of claim 6, wherein the system further comprises:

A confirmation unit, which is used to convert the m monophonic data into corresponding m text information through a voice recognition method;

According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
The system according to claim 6, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses Know the audio samples and the corresponding known human voice data samples after training the neural network model; at the same time,

The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
An audio processing system, characterized in that the system includes:

An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

A first conversion unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method;

An obtaining unit for obtaining human voice audio data and switching points according to the audio data and audio text, wherein the human voice audio data is audio data obtained by removing the noise from the audio data; The switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;

The second conversion unit, the second conversion unit is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein among the m monophonic data Each monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same one interlocutor;

The confirmation unit is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
The system according to claim 9, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is used The model obtained by training the neural network with known audio samples and corresponding known human voice data samples; at the same time,

The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the instructions when executing any one of claims 1 to 3. The method described in the item.
A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions:

Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

Converting the audio data into corresponding text data through a voice recognition method;

According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;

Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;

Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
The server according to claim 12, wherein the processor further executes the following steps when executing the instruction:

The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,

The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
A computer non-transitory storage medium, characterized in that the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the computer program is implemented as described in any one of claims 1 to 3. The method described.
A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the following steps are implemented:

Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

Converting the audio data into corresponding text data through a voice recognition method;

According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;

Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;

Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
The computer non-transitory storage medium according to claim 15, wherein the computer program further implements the following steps when being executed by a computing device:

The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,

The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
A computer program product, characterized in that, when the computer program product is read and executed by a computer, the method according to any one of claims 1 to 3 is implemented.
A computer program product, characterized in that, when the computer program product is read and executed by a computer, the following steps are implemented:

Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;

Converting the audio data into corresponding text data through a voice recognition method;

According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;

Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;

Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;

Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
The computer program product according to claim 18, wherein when the computer program product is read and executed by a computer, the following steps are further implemented:

The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,

The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.