WO2020238209A1 - Audio processing method, system and related device - Google Patents

Audio processing method, system and related device Download PDF

Info

Publication number
WO2020238209A1
WO2020238209A1 PCT/CN2019/130550 CN2019130550W WO2020238209A1 WO 2020238209 A1 WO2020238209 A1 WO 2020238209A1 CN 2019130550 W CN2019130550 W CN 2019130550W WO 2020238209 A1 WO2020238209 A1 WO 2020238209A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
data
audio data
human voice
monophonic
Prior art date
Application number
PCT/CN2019/130550
Other languages
French (fr)
Chinese (zh)
Inventor
周维聪
涂臻
Original Assignee
深圳追一科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳追一科技有限公司 filed Critical 深圳追一科技有限公司
Publication of WO2020238209A1 publication Critical patent/WO2020238209A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • This application relates to audio processing methods, systems and related equipment.
  • Speaker Diarization refers to the audio data in a multi-person dialogue, divided according to the speaker, and labeled process.
  • the existing speaker separation system achieves a practical level of accuracy in a clean near-field environment, but when the environment is relatively complex, the important speaker independent single-channel speech separation has low separation accuracy.
  • an audio processing method, system and related equipment are provided.
  • An audio processing method including: obtaining audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and obtaining the human voice audio data according to the audio data And a switching point, wherein the human voice audio data is the human voice generated when the n interlocutors are engaged in a conversation after the noise is removed from the audio data, and the switching point is any of the n interlocutors
  • the data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups Belong to the same interlocutor.
  • An audio processing method includes: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and converting the audio data by a voice recognition method Is the corresponding text data; according to the audio data and text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice data is converted into m monophonic data, wherein the m Each monophonic data in the monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the n audio groups The monophonic data of each audio group in the audio group belongs to the same interlocutor; the interlocutor to which each audio group in the
  • An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; the acquisition unit, The obtaining unit is configured to obtain human voice audio data and a switching point according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the audio data.
  • any one of the n interlocutors switches to another interlocutor’s conversation time point; a conversion unit, the conversion unit is used for switching points, and converts the human voice audio data into m monophonic data, wherein , Each of the m monophonic data only contains the human voice of a single interlocutor, and m is a positive integer; a clustering unit, the clustering unit is used to cluster the m monophonic data , Obtain n audio groups, wherein the monophonic data in each audio group in the n audio groups belong to the same interlocutor.
  • An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; a first conversion Unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method; the obtaining unit is used to obtain human voice audio data and switching points according to the audio data and audio text, Wherein, the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor A second conversion unit, the second conversion unit is used to convert the human voice audio data into m monophonic data according to the human voice audio data and switching points, wherein the m monophonic data Each monophonic data of only includes the human voice of a single interlocutor, m is a positive integer; clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, where, The monophonic
  • a server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and the human voice produced by n interlocutors in a dialogue, where n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the The audio data removes the noise and obtains the human voice generated by n interlocutors during a conversation, and the switching point is the conversation time point when any interlocutor of the n interlocutors switches to another interlocutor; The switching point converts the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The m monophonic data are clustered to obtain n audio groups, wherein the monophonic data of each audio group in
  • a server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and human voices produced by n interlocutors during a conversation, where n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; according to the audio data and text data, Obtain human voice audio data and a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is any one of the n interlocutors switching To another interlocutor’s dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each of the m monophonic data includes only a single dialogue The human voice of the speaker, m is a positive integer; cluster the m monophonic data to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the
  • a computer non-transitory storage medium wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the audio data minus the The human voice generated when n interlocutors are engaged in a conversation obtained after the noise, the switching point is the conversation time point when any one of the n interlocutors switches to another interlocutor; according to the switching point, Convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The acoustic data is clustered to obtain n audio groups, where the mono
  • a computer non-transitory storage medium wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; the human voice audio data is obtained according to the audio data and the text data And a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is that any one of the n interlocutors switches to another interlocutor The dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, m is a positive integer; clustering the m monophonic data to obtain n audio groups,
  • a computer program product when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors obtained by removing the noise from the audio data ,
  • the switching point is a dialogue time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, Wherein, each monophonic data in the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, where all The monophonic data of each audio group in the n audio groups belong to the same interlocutor.
  • a computer program product when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; the audio data is converted into corresponding text data through the voice recognition method; the human voice audio data and the switching point are obtained according to the audio data and text data, wherein the human voice audio data is the The audio data is the audio data obtained after the noise is removed, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the person The sound data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; the m monophonic data are clustered , Obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocut
  • Fig. 1 is a schematic structural diagram of a speaker separation system provided by the present application.
  • Fig. 2 is a schematic flowchart of an audio processing method provided by the present application.
  • FIG. 3 is a schematic flowchart of clustering m monophonic data into n audio groups in an application scenario provided by the present application.
  • Fig. 4 is a detailed flowchart of an audio processing method provided by the present application.
  • Fig. 5 is a schematic flowchart of another audio processing method provided by the present application.
  • Fig. 6 is a detailed flowchart of another audio processing method provided by the crowd.
  • Fig. 7 is a schematic structural diagram of an audio processing system provided by the present application.
  • Fig. 8 is a schematic structural diagram of another audio processing system provided by the present application.
  • Fig. 9 is a schematic structural diagram of a server provided by the present application.
  • Speaker separation refers to the process of dividing and labeling audio data in a multi-person conversation according to the speakers.
  • the speaker separation system is generally based on the Bayesian Information Criterion (BIC) as a similarity measure for speaker separation.
  • BIC Bayesian Information Criterion
  • the technology mainly passes through the input module 101 and the silence detection module in turn. 103.
  • the silence detection module 102 is used to remove the silence part of the input audio data to obtain the second audio data;
  • the speaker recognition module 103 learns the voiceprint characteristics of speakers in a large number of business scenarios, for example, the speaker separation system is used for Separate the voices of the customer service and the user, then the speaker recognition module will learn a large number of customer service and user voiceprint characteristics, such as the voice intonation characteristics, prosody characteristics of the customer service, etc., so that the speaker recognition module can, based on the second audio data,
  • the identity of the speaker of each dialogue in the current audio data is determined to obtain the third audio data;
  • the switching point detection module 104 is used to determine the dialogue switching point of the speaker according to the third audio data;
  • the conversion module 105 is used to Switching points, thereby editing the third audio data into multiple pieces of audio data;
  • the classification module 106 is used to classify the multiple pieces of audio data according to the identity of the speaker detected by the speaker recognition module 103 to obtain the speaker separation result.
  • the system not only needs to train more models for business scene data (silence detection model, speaker recognition model, and switching point model) in the training phase, but also has a longer detection process in the detection phase, which needs to go through Figure 1 in turn. Only the seven modules shown can obtain the test results, which requires a lot of time.
  • Fig. 2 is an audio processing method provided by the present application. The method includes the following steps:
  • the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
  • the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
  • the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
  • human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
  • the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
  • the switching point is any one of the n interlocutors.
  • the person switches to the conversation time point of another interlocutor.
  • the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: “I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
  • the numbers in front of A and B represent the playback time of the current audio data, or the time axis of the audio data. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20.
  • step S202 since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B.
  • the above examples are only for illustration and cannot constitute a specific limitation.
  • the obtaining human voice audio data and switching points according to the audio data includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice
  • the acoustic separation model is a model obtained after training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data is input into the speaker change detection model (Speaker Change Detection, SCD) to obtain the The switching point.
  • the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
  • the human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
  • VAD Voice Activity Detection
  • step S202 since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers.
  • the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
  • the human voice audio data is converted into m monophonic data.
  • the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
  • each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
  • the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
  • each monophonic data has only one interlocutor’s human voice.
  • the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
  • the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
  • the human voice included in the monophonic data is only user A or customer service B.
  • FIG. 3 shows the monophonic data in the form of text.
  • the monophonic data It is only audio data, and does not contain text information.
  • the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: “I want to check the phone bill” and “Yes” corresponding to the 2 monophonic data, the other type of audio group is: “Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
  • the method further includes: using a speech recognition method (Automatic Speech Recognition, ASR) to combine all The m monophonic data are converted into corresponding m text information; according to the m text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
  • ASR Automatic Speech Recognition
  • the interlocutor described in each audio group can be determined by keywords or key words in the text information.
  • the interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc.
  • This application does not specifically limit the setting of keywords.
  • dynamic time warping Dynamic Time Warping, DTW
  • Hidden Markov Model HMM
  • vector quantization VQ
  • artificial neural network Artificial Neural Network, ANN
  • voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
  • FIG. 4 shows a schematic flow chart of an audio processing method provided by the present application. It can be seen from Figure 4 that, compared with the speaker separation method in the embodiment of Figure 1, the audio processing method provided by the present application only needs to focus on services during the training phase.
  • the scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced.
  • the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation
  • the model removes the interference of noise information, and the accuracy is greatly improved.
  • the audio data by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m
  • the monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice.
  • the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
  • multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
  • FIG. 5 is another audio processing method provided by this application.
  • the difference between the audio processing method shown in FIG. 5 and the audio processing method shown in FIG. 2 is that the process of voice recognition is performed in advance to obtain the text characteristics of the audio data, so that the switching point detection model can comprehensively consider the audio characteristics and The text feature can detect the interlocutor’s switching point more accurately.
  • the audio processing method shown in FIG. 5 includes the following steps:
  • the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
  • the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
  • the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
  • human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
  • the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
  • the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
  • the switching point is any one of the n interlocutors.
  • the person switches to the conversation time point of another interlocutor.
  • the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
  • the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points.
  • the above examples are only for illustration and do not constitute a specific limitation.
  • the obtaining human voice audio data and switching points according to the audio data and audio text includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein
  • the human voice separation model is a model obtained by training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain all The switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
  • the human voice separation model may specifically be a VAD model, which uses a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, so as to obtain the human voice separation model.
  • the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
  • the two steps can be performed simultaneously, and then the switching point detection step is performed.
  • the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously.
  • the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
  • the human voice audio data is converted into m monophonic data.
  • the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
  • the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
  • the switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
  • each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
  • the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
  • each monophonic data has only one interlocutor’s human voice.
  • the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
  • the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
  • the human voice included in the monophonic data is only user A or customer service B.
  • the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
  • the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: “I think 2 monophonic data corresponding to "check call charge” and "yes”, the other type of audio group is: “Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan” corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
  • the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait” to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges"
  • the interlocutor of 1 is a user, and the interlocutor of category 2 is determined to be customer service, etc.
  • the application does not specifically limit the setting of keywords.
  • Fig. 6 shows a schematic flow chart of an audio processing method provided by this application. It can be seen from Fig. 6 that, compared with the speaker separation method in the embodiment of Fig. 1, the audio processing method provided by this application only needs to focus on services during the training phase.
  • the scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced.
  • the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation
  • the model removes the interference of noise information, and the switching point detection model integrates text features and audio features, which greatly improves the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
  • audio data is acquired; the audio data is converted into corresponding text data through a voice recognition method; human voice audio data and switching points are obtained according to the audio data and text data; according to the switching points, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups belongs.
  • the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
  • multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
  • FIG. 7 is an audio processing system provided by the present application.
  • the audio processing system 700 includes an acquisition unit 710, an acquisition unit 720, a conversion unit 730, and a clustering unit 740, wherein:
  • the acquiring unit 710 is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
  • the obtaining unit 720 is configured to obtain human voice audio data and a switching point according to the audio data, where the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is Any one of the n interlocutors switches to the conversation time point of another interlocutor;
  • the conversion unit 730 is used for switching points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only contains the human voice of a single interlocutor, m is a positive integer;
  • the clustering unit 740 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same interlocutor .
  • the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
  • the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
  • the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
  • human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
  • the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
  • the switching point is any one of the n interlocutors.
  • the person switches to the conversation time point of another interlocutor.
  • the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: “I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
  • the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20.
  • step S202 since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B.
  • the above examples are only for illustration and cannot constitute a specific limitation.
  • the obtaining unit 720 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples and The model obtained by training the neural network on the corresponding known human voice data samples; at the same time, the audio data is input into the switching point detection model to obtain the switching point, wherein the switching point detection model uses the known audio The model obtained after training the neural network with the samples and the corresponding known switching point samples.
  • the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
  • the human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
  • VAD Voice Activity Detection
  • step S202 since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers.
  • the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
  • the human voice audio data is converted into m monophonic data.
  • the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
  • each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
  • the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
  • each monophonic data has only one interlocutor’s human voice.
  • the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
  • the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
  • the human voice included in the monophonic data is only user A or customer service B.
  • FIG. 3 shows the monophonic data in the form of text.
  • the monophonic data It is only audio data, and does not contain text information.
  • the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: “I want to check the phone bill” and “Yes” corresponding to the 2 monophonic data, the other type of audio group is: “Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
  • the system further includes a confirmation unit 750, which is configured to: after clustering the m monophonic data to obtain n audio groups, perform voice recognition Method: Convert the m monophonic data into corresponding m text information; according to the m text information, confirm the interlocutor to which each audio group of the n audio groups belongs.
  • the interlocutor described in each audio group can be determined by keywords or key words in the text information.
  • the interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc.
  • This application does not specifically limit the setting of keywords.
  • dynamic time warping Dynamic Time Warping, DTW
  • Hidden Markov Model HMM
  • vector quantization VQ
  • artificial neural network Artificial Neural Network, ANN
  • voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
  • the input data in the clustering process is m monophonic data
  • the input data in the speech recognition process can also be m monophonic data
  • the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience.
  • FIG. 7 it can be seen from FIG. 7 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application needs to train the human voice separation model and the switching point detection model only for the business scene data in the training phase. The cost of labor and labor is greatly reduced, and the steps in the detection stage can be processed in parallel, which greatly shortens the time required for speaker separation.
  • the use of the human voice separation model to remove the interference of noise information greatly improves the accuracy.
  • the audio data by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m
  • the monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice.
  • the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
  • multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
  • Fig. 8 is another audio processing system provided by the present application.
  • the difference between the audio processing system shown in Fig. 8 and the audio processing system shown in Fig. 7 is that the process of voice recognition is performed in advance to obtain audio data.
  • Text features enable the switching point detection model to comprehensively consider audio features and text features, and more accurately detect the switch point of the interlocutor.
  • the audio processing system 800 shown in FIG. 8 includes an acquisition unit 810, a first conversion unit 820, an acquisition unit 830, a second conversion unit 840, a clustering unit 850, and a confirmation unit 860, where
  • the acquiring unit 810 is configured to acquire audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
  • the conversion unit 820 is configured to convert the audio data into audio text through a voice recognition method
  • the obtaining unit 830 is configured to obtain human voice audio data and switching points according to the audio data and audio text, where the human voice audio data is audio data obtained by removing the noise from the audio data, and
  • the switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;
  • the second conversion unit 840 is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein each of the m monophonic data
  • the data only includes the human voice of a single interlocutor, and m is a positive integer;
  • the clustering unit 850 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
  • the confirmation unit 860 is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  • the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
  • the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
  • the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
  • human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
  • the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
  • the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
  • the switching point is any one of the n interlocutors.
  • the person switches to the conversation time point of another interlocutor.
  • the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
  • the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points.
  • the above examples are only for illustration and do not constitute specific limitations.
  • the obtaining unit 830 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples And the model obtained by training the neural network with the corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain the switching point, wherein the switching point detection model is A model obtained by training the neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
  • the human voice separation model can be a VAD model, which uses a similar event detection scheme to learn the distinguishing characteristics between human voice and non-human voice to obtain a human voice separation model.
  • the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
  • the two steps can be performed simultaneously, and then the switching point detection step is performed.
  • the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. So as to perform tasks in parallel as much as possible, and minimize the time required for the speaker separation process.
  • the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
  • the human voice audio data is converted into m monophonic data.
  • the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
  • the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
  • the switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
  • each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
  • the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
  • each monophonic data has only one interlocutor’s human voice.
  • the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
  • the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
  • the human voice included in the monophonic data is only user A or customer service B.
  • the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
  • the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: “I think 2 monophonic data corresponding to "check call charge” and "yes”, the other type of audio group is: “Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan” corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
  • the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait” to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and then the interlocutor of category 2 is determined to be customer service, etc. This application does not specifically limit the setting of keywords.
  • the audio processing system provided by this application only needs to train the human voice separation model and the switching point detection model for the business scene data in the training phase.
  • the time and labor costs of the training phase are greatly reduced.
  • the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation.
  • the human voice separation model is used to remove the interference of noise information, and the switching point detection model is integrated.
  • the text features and audio features greatly improve the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
  • the audio data is converted into corresponding text data; according to the audio data and text data, the human voice audio data and the switching point are obtained; according to the switching point, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups belongs.
  • the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
  • multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
  • FIG. 9 is a schematic structural diagram of a server provided in this application.
  • the server may implement the method in the embodiment in FIG. 2 or the embodiment in FIG. 5.
  • the data processing method provided in this application can be implemented in a cloud service cluster as shown in FIG. 9 or in a single computer node and a storage node. This application is not specifically limited.
  • the cloud service cluster includes At least one computing node 910 and at least one storage node 920.
  • the computing node 910 includes one or more processors 911, a communication interface 912, and a memory 913.
  • the processor 911, the communication interface 912, and the memory 913 may be connected through a bus 914.
  • the processor 911 includes one or more general-purpose processors.
  • the general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on. It can be a dedicated processor used only for the computing node 910 or can be shared with other computing nodes 910.
  • the processor 911 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 913, which enables the computing node 910 to provide a wide variety of services.
  • the processor 911 can execute codes of modules such as clustering, human voice separation, and switching point detection to execute at least a part of the methods discussed herein.
  • the communication interface 912 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users.
  • the communication interface 912 may adopt a protocol family over TCP/IP, for example, RAAS protocol, Remote Function Call (RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (SNMP) protocol, Common Object Request Broker Architecture (CORBA) protocol, distributed protocol, etc.
  • the memory 913 may include a volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read-Only Memory).
  • the storage node 920 includes one or more storage controllers 921 and a storage array 922. Among them, the storage controller 921 and the storage array 922 may be connected through a bus 924.
  • the storage controller 921 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait.
  • each storage node includes a storage controller. In other embodiments, multiple storage nodes may also share a storage controller, which is not specifically limited here.
  • the memory array 922 may include a plurality of memories 923.
  • the memory 923 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory.
  • the storage array may be composed of multiple HDDs or multiple SDDs, or the storage array may be composed of HDDs and SDDs.
  • the memory array 923 may include one or more data centers. Multiple data centers can be set up at the same location, or at different locations, and there is no specific limitation here.
  • the memory array 923 may store program codes and program data.
  • the program code includes voice recognition module code, semantic understanding module code, order production module code, and so on.
  • the program data includes: clustering module code data, human voice separation module code data, switching point detection module code data, etc., and can also include subscription human voice separation models, switching point detection models, and corresponding training sample set data, etc.
  • the application is not specifically limited.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium, (for example, a floppy disk, a storage disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).
  • the program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Abstract

An audio processing method, a system and a related device. The method comprises: acquiring audio data (S201); acquiring human voice audio data and switching points according to the audio data (S202); converting the human voice audio data into m single sound data according to the switching points (S203); clustering the m single sound data to obtain n audio groups (S204), wherein the single sound data of each audio group in the n audio groups belongs to the same speaker.

Description

音频处理的方法、系统及相关设备Audio processing method, system and related equipment
相关申请的交叉引用Cross references to related applications
本申请要求于2019年05月28日提交中国专利局,申请号为2019104534937,申请名称为“音频处理的方法、系统及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2019. The application number is 2019104534937 and the application title is "Methods, Systems and Related Equipment for Audio Processing". The entire contents of this application are incorporated herein by reference. Applying.
技术领域Technical field
本申请涉及音频处理的方法、系统及相关设备。This application relates to audio processing methods, systems and related equipment.
背景技术Background technique
当前,语音技术融入了人们生活,带给人类更便捷的生活方式。随着音频处理技术的不断提高,从海量的音频数据中(如电话录音、新闻广播、会议录音等)获取感兴趣的特定人声已成为一大研究热点。其中,获取特定人声的方法之一就是通过话者分离(Speaker Diarization,SD)系统实现的,话者分离指的是从多人对话中的音频数据,依据说话人进行划分,并加以标记的过程。Currently, voice technology has been integrated into people's lives, bringing people a more convenient lifestyle. With the continuous improvement of audio processing technology, obtaining specific human voices of interest from massive audio data (such as telephone recordings, news broadcasts, conference recordings, etc.) has become a major research focus. Among them, one of the methods to obtain a specific human voice is through the Speaker Diarization (SD) system. Speaker separation refers to the audio data in a multi-person dialogue, divided according to the speaker, and labeled process.
现有的话者分离系统,在干净近场的环境下准确率达到了实用水平,但是在环境相对复杂时,重要说话人独立的单通道语音分离,分离精度较低。The existing speaker separation system achieves a practical level of accuracy in a clean near-field environment, but when the environment is relatively complex, the important speaker independent single-channel speech separation has low separation accuracy.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种音频处理的方法、系统及相关设备。According to various embodiments disclosed in this application, an audio processing method, system and related equipment are provided.
一种音频处理的方法,包括:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。An audio processing method, including: obtaining audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and obtaining the human voice audio data according to the audio data And a switching point, wherein the human voice audio data is the human voice generated when the n interlocutors are engaged in a conversation after the noise is removed from the audio data, and the switching point is any of the n interlocutors The conversation time point of one interlocutor switching to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, wherein each of the m monophonic data The data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups Belong to the same interlocutor.
一种音频处理的方法,包括:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述 人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。An audio processing method includes: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and converting the audio data by a voice recognition method Is the corresponding text data; according to the audio data and text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice data is converted into m monophonic data, wherein the m Each monophonic data in the monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the n audio groups The monophonic data of each audio group in the audio group belongs to the same interlocutor; the interlocutor to which each audio group in the n audio groups belongs is confirmed according to the audio text.
一种音频处理系统,包括:获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包含噪声和n个对话者进行对话时产生的人声,n是正整数;获得单元,所述获得单元用于根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;转换单元,所述转换单元用于切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包含单个对话者的人声,m是正整数;聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组中的单声数据属于同一个对话者。An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; the acquisition unit, The obtaining unit is configured to obtain human voice audio data and a switching point according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the audio data. Any one of the n interlocutors switches to another interlocutor’s conversation time point; a conversion unit, the conversion unit is used for switching points, and converts the human voice audio data into m monophonic data, wherein , Each of the m monophonic data only contains the human voice of a single interlocutor, and m is a positive integer; a clustering unit, the clustering unit is used to cluster the m monophonic data , Obtain n audio groups, wherein the monophonic data in each audio group in the n audio groups belong to the same interlocutor.
一种音频处理系统,包括:获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;第一转换单元,所述转换单元用于通过语音识别方法,将所述音频数据转换为音频文本;获得单元,所述获得单元用于根据所述音频数据以及音频文本,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;第二转换单元,所述第二转换单元用于根据所述人声音频数据以及切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;确认单元,所述确认单元用于根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; a first conversion Unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method; the obtaining unit is used to obtain human voice audio data and switching points according to the audio data and audio text, Wherein, the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor A second conversion unit, the second conversion unit is used to convert the human voice audio data into m monophonic data according to the human voice audio data and switching points, wherein the m monophonic data Each monophonic data of only includes the human voice of a single interlocutor, m is a positive integer; clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, where, The monophonic data of each audio group in the n audio groups belong to the same interlocutor; a confirmation unit, the confirmation unit is used to confirm according to the audio text to which each audio group in the n audio groups belongs interlocutor.
一种服务器,所述服务器包括处理器以及存储器,所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行以下步骤:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。A server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and the human voice produced by n interlocutors in a dialogue, where n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the The audio data removes the noise and obtains the human voice generated by n interlocutors during a conversation, and the switching point is the conversation time point when any interlocutor of the n interlocutors switches to another interlocutor; The switching point converts the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The m monophonic data are clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor.
一种服务器,所述服务器包括处理器以及存储器,所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行一下步骤:获取音频数据,其中, 所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。A server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and human voices produced by n interlocutors during a conversation, where n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; according to the audio data and text data, Obtain human voice audio data and a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is any one of the n interlocutors switching To another interlocutor’s dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each of the m monophonic data includes only a single dialogue The human voice of the speaker, m is a positive integer; cluster the m monophonic data to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the same interlocutor ; Confirm the interlocutor to which each audio group in the n audio groups belongs according to the audio text.
一种计算机非瞬态存储介质,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现以下步骤:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the audio data minus the The human voice generated when n interlocutors are engaged in a conversation obtained after the noise, the switching point is the conversation time point when any one of the n interlocutors switches to another interlocutor; according to the switching point, Convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The acoustic data is clustered to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the same interlocutor.
一种计算机非瞬态存储介质,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现以下步骤:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; the human voice audio data is obtained according to the audio data and the text data And a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is that any one of the n interlocutors switches to another interlocutor The dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor; according to the audio The text identifies the interlocutor to which each audio group in the n audio groups belongs.
一种计算机程序产品,当所述计算机程序产品被计算机读取并执行时,实现以下步骤:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。A computer program product, when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors obtained by removing the noise from the audio data , The switching point is a dialogue time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, Wherein, each monophonic data in the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, where all The monophonic data of each audio group in the n audio groups belong to the same interlocutor.
一种计算机程序产品,当所述计算机程序产品被计算机读取并执行时,实现以下步骤:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。A computer program product, when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; the audio data is converted into corresponding text data through the voice recognition method; the human voice audio data and the switching point are obtained according to the audio data and text data, wherein the human voice audio data is the The audio data is the audio data obtained after the noise is removed, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the person The sound data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; the m monophonic data are clustered , Obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor; according to the audio text, it is confirmed that each audio group in the n audio groups belongs to interlocutor.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1是本申请提供的一种话者分离系统的结构示意图。Fig. 1 is a schematic structural diagram of a speaker separation system provided by the present application.
图2是本申请提供的一种音频处理方法的流程示意图。Fig. 2 is a schematic flowchart of an audio processing method provided by the present application.
图3是本申请提供的一种应用场景下的m个单声数据聚类为n个音频组的流程示意图。FIG. 3 is a schematic flowchart of clustering m monophonic data into n audio groups in an application scenario provided by the present application.
图4是本申请提供的一种音频处理方法的详细流程示意图。Fig. 4 is a detailed flowchart of an audio processing method provided by the present application.
图5是本申请提供的另一种音频处理方法的流程示意图。Fig. 5 is a schematic flowchart of another audio processing method provided by the present application.
图6是本是人群提供的另一种音频处理方法的详细流程示意图。Fig. 6 is a detailed flowchart of another audio processing method provided by the crowd.
图7是本申请提供的一种音频处理系统的结构示意图。Fig. 7 is a schematic structural diagram of an audio processing system provided by the present application.
图8是本申请提供的另一种音频处理系统的结构示意图。Fig. 8 is a schematic structural diagram of another audio processing system provided by the present application.
图9是本申请提供的一种服务器的结构示意图。Fig. 9 is a schematic structural diagram of a server provided by the present application.
具体实施方式Detailed ways
下面通过具体实施方式结合附图对本申请作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或描述,这是为了避免本申请的核心部分被过多的描述所淹没。对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。The application will be further described in detail below through specific implementations in conjunction with the drawings. In the following embodiments, many detailed descriptions are used to make the present application better understood. However, those skilled in the art can easily realize that some of the features can be omitted under different circumstances, or can be replaced by other methods. In some cases, some operations related to this application are not shown or described in the specification. This is to avoid the core part of this application being overwhelmed by excessive description. For those skilled in the art, it is not necessary to describe these related operations in detail. They can fully understand the related operations based on the description in the specification and general technical knowledge in the field.
为了便于本方案更好的被理解,下面对“话者分离系统”进行简单介绍。In order to facilitate the better understanding of this solution, the following is a brief introduction to the "talker separation system".
话者分离指的是将多人对话中的音频数据,依据说话人进行划分,并加以标记的过程。具体实现中,话者分离系统一般是基于贝叶斯信息准则(Bayesian Information Criterion,BIC)作为相似性度量进行话者分离,如图1所示,该技术主要依次通过输入模块101、静音检测模块103、说话人识别模块103、切换点检测模块104、分类模块105以及输出模块106,从而获得分离后的音频结果。其中,静音检测模块102用于去除输入的音频数据中的静音部分,获得第二音频数据;说话人识别模块103学习了大量业务场景中说话人的声纹特征,比如该话者分离系统用于分离客服和用户的声音,那么说话人识别模块将学习大量客服和用户声纹特征,比如客服的语音语调特征、韵律节奏特征等等,从而使得说话人识别模块可以根据所述第二音频数据,确定当前音频数据中每段对话的说话人的身份,获得第三音频数据;切换点检测模块104用于根据所述第三音频数据,确定说话人的对话切换点;转换模块105用于根据对话切换点,从而将第三音频数据剪辑为多段音频数据;分类模块106用于将上述多段音频数据按照说话人识别模块103检测出的说话人的身份进行分类,从而获得话者分离的结果。Speaker separation refers to the process of dividing and labeling audio data in a multi-person conversation according to the speakers. In specific implementation, the speaker separation system is generally based on the Bayesian Information Criterion (BIC) as a similarity measure for speaker separation. As shown in Figure 1, the technology mainly passes through the input module 101 and the silence detection module in turn. 103. The speaker recognition module 103, the switching point detection module 104, the classification module 105, and the output module 106, so as to obtain the separated audio result. Among them, the silence detection module 102 is used to remove the silence part of the input audio data to obtain the second audio data; the speaker recognition module 103 learns the voiceprint characteristics of speakers in a large number of business scenarios, for example, the speaker separation system is used for Separate the voices of the customer service and the user, then the speaker recognition module will learn a large number of customer service and user voiceprint characteristics, such as the voice intonation characteristics, prosody characteristics of the customer service, etc., so that the speaker recognition module can, based on the second audio data, The identity of the speaker of each dialogue in the current audio data is determined to obtain the third audio data; the switching point detection module 104 is used to determine the dialogue switching point of the speaker according to the third audio data; the conversion module 105 is used to Switching points, thereby editing the third audio data into multiple pieces of audio data; the classification module 106 is used to classify the multiple pieces of audio data according to the identity of the speaker detected by the speaker recognition module 103 to obtain the speaker separation result.
综上可知,该系统不仅在训练阶段需要针对业务场景数据训练较多的模型(静音检测模型、说话人识别模型以及切换点模型),检测阶段的检测流程也较长,需要依次通过图1所示的七个模块才能获得检测结果,时间开销大。In summary, the system not only needs to train more models for business scene data (silence detection model, speaker recognition model, and switching point model) in the training phase, but also has a longer detection process in the detection phase, which needs to go through Figure 1 in turn. Only the seven modules shown can obtain the test results, which requires a lot of time.
因此,本申请提供了一种音频处理的方法,该方法检测流程时间短,训练模型数量少,时间开销大大减少。图2是本申请提供的一种音频处理的方法,所述方法包括以下步骤:Therefore, this application provides an audio processing method, which has a short detection process time, a small number of training models, and a greatly reduced time overhead. Fig. 2 is an audio processing method provided by the present application. The method includes the following steps:
S201:获取音频数据。S201: Acquire audio data.
具体实现中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数。音频数据具体可以是需要进行话者分离的电话录音或者视频录音等音频文件,比如,用户与客服电话沟通的录音文件,会议的视频录音文件等等。其中,本申请中的噪声概念为不需要接下来进行话者分离的非目标人声,具体可以是环境音、静音以及录音设备产生的设备噪声等等。需要说明的是,环境音可能也出现人声,比如用户A在嘈杂的餐厅与客服B电话沟通生成的音频数据,此时餐厅中其他人说话的声音虽然也是人声,但不属于接下来需要进行话者分离的目标人声,因此也将被列为噪声。应理解,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer. The audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on. Among them, the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on. It should be noted that human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need. The target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
S202:根据所述音频数据,获得人声音频数据以及切换点。S202: Obtain human voice audio data and switching points according to the audio data.
具体实现中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点。其中,切换点指的是一个对话者切换到另一个对话者的对话时间点。举例来说,假设n=2,也就是两个对话者,分别为用户A与客服B,在某一应用场景下,A与B对话音频数据用文字表示可以为:(00:12)A:“我想查话费”(00:14)B:“请问您是查本机话费吗”(00:18)A:“是的”(00:20)B:“请稍等”(00:25)B:“您的话费余额还 有25元”。其中,A、B前面的数字代表当前音频数据的播放时间,或者说是音频数据的时间轴。此时切换点共有3个,可以是00:14、00:18以及00:20。应理解,为了能更清楚的表达本申请提供的音频处理方法的主要思想,因此在上述举例中将对话内容直接以文字的形式进行表示,但是在实际处理过程中,由于步骤S202并没有进行语音识别,因此切换点是在不知道A与B的对话内容的情况下,仅根据A与B的对话音频数据获得的。并且,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current audio data, or the time axis of the audio data. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that, in order to more clearly express the main idea of the audio processing method provided by this application, the content of the dialogue is directly expressed in the form of text in the above example, but in the actual processing process, since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B. In addition, the above examples are only for illustration and cannot constitute a specific limitation.
在一具体的实施方式中,所述根据所述音频数据,获得人声音频数据以及切换点包括:将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,将所述音频数据输入切换点检测模型(Speaker Change Detection,SCD),从而获得所述切换点。其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。人声分离模型具体可以是语音活动检测模型(Voice Activity Detection,VAD),采用类似事件检测方案,对人声和非人声之间的区别特征进行学习,从而获得人声分离模型。可以理解的是,VAD的检测结果极大程度上去除了环境噪声、非目标人声、设备噪声等干扰因素,相比图1实施例中只检测静音和非静音,最终获得的分离结果准确度更高。In a specific embodiment, the obtaining human voice audio data and switching points according to the audio data includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice The acoustic separation model is a model obtained after training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data is input into the speaker change detection model (Speaker Change Detection, SCD) to obtain the The switching point. Wherein, the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples. The human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
需要说明的,步骤S202中,由于人声分离模型以及切换点检测模型的输入数据都是音频数据,因此人声分离可以和切换点检测并行进行,二者不会互相干扰,从而一定减少话者分离所需的时间。并且,所述切换点是基于音频数据的时间轴获得的对话时间点,因此去除噪声后的人声音频数据的时间轴仍与音频数据相同,以便接下来基于相同的时间轴,使用切换点将人声音频数据转换为m个单声数据。并且,切换点检测模型以及人声分离模型还可以充分利用和业务场景相关的有标签数据进行训练,从而进一步提升切换点检测以及人声分离的准确度。It should be noted that in step S202, since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers. The time required for separation. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
S203:根据所述切换点,将所述人声音频数据转换为m个单声数据。S203: According to the switching point, convert the human voice audio data into m monophonic data.
具体实现中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数。仍以上述例子为例,在获得3个切换点后,可以将音频数据转换为4个单声数据,这4个单声数据可以如图3所示。其中,每个单声数据都只有一个对话者的人声,比如,单声数据可以是“我想查话费”这一句话对应的音频数据,还可以是“请稍等”、“您的话费余额还有25元”这两句话对应的音频数据。也就是说,单声数据中包含的人声只有用户A或者客服B。应理解,上述举例仅用于说明,并不能构成具体限定。并且,图3为了便于更好的理解单声数据的定义,图3是将单声数据以文本的形式进行了展示,但是在实际情况中,由于步骤S203并没有进行语音识别,因此单声数据仅仅是一个音频数据,并没有包含文本信息。In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. In addition, in order to facilitate a better understanding of the definition of the monophonic data, FIG. 3 shows the monophonic data in the form of text. However, in actual situations, since no speech recognition is performed in step S203, the monophonic data It is only audio data, and does not contain text information.
S204:将所述m个单声数据进行聚类,得到n个音频组。S204: Cluster the m monophonic data to obtain n audio groups.
具体实现中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。仍以上 述例子为例,在获得4个单声数据之后,对这4个单声数据进行聚类(Clustering),得到2个音频组,具体可以如图3所示,一类音频组为:“我想查话费”以及“是的”对应的2个单声数据,另一类音频组为:“请问您是查本机话费吗”、“请稍等,您的话费余额还有25元”对应的2个单声数据,最终将包含n个对话者人声的音频数据转换为n个音频组。In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: "I want to check the phone bill" and "Yes" corresponding to the 2 monophonic data, the other type of audio group is: "Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
在一具体的实施方式中,在所述将所述m个单声数据进行聚类,得到n个音频组之后,所述方法还包括:通过语音识别方法(Automatic Speech Recognition,ASR),将所述m个单声数据转换为相应的m个文本信息;根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。具体实现中,可以通过文本信息中的关键字或关键话术,确定每个音频组所述的对话者,比如,仍以上述例子为例,可以根据关键字“请问”“请稍等”确定类别2的对话者为客服,进而确定类别1的对话者为用户,或者根据关键字“我想”“查话费”确定类别1的对话者为用户,进而确定类别2的对话者为客服等等,对于关键字的设定,本申请不作具体限定。In a specific implementation, after clustering the m monophonic data to obtain n audio groups, the method further includes: using a speech recognition method (Automatic Speech Recognition, ASR) to combine all The m monophonic data are converted into corresponding m text information; according to the m text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed. In specific implementation, the interlocutor described in each audio group can be determined by keywords or key words in the text information. For example, taking the above example as an example, you can determine according to the keywords "Excuse me" and "Please wait" The interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc. , This application does not specifically limit the setting of keywords.
具体实现中,可以通过动态时间规整(Dynamic Time Warping,DTW)、隐马尔可夫(Hidden Markov Model,HMM)理论、矢量量化(Vector Quantization,VQ)技术以及基于人工神经网络(Artificial Neural Network,ANN)等等语音识别方法,将所述m个单声数据转换为相应的m个文本信息,本申请不作具体限定。In specific implementation, dynamic time warping (Dynamic Time Warping, DTW), Hidden Markov Model (HMM) theory, vector quantization (VQ) technology, and artificial neural network (Artificial Neural Network, ANN) ) And other voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
需要说明的,由于聚类过程中的输入数据为m个单声数据,语音识别过程中的输入数据也可以是m个单声数据,因此聚类和语音识别的过程也可以并行处理,从而进一步提升整个话者分离的流程,提高用户使用体验。图4显示了本申请提供的一种音频处理方法的流程示意图,由图4可知,本申请提供的音频处理方法,与图1实施例中话者分离方法相比,在训练阶段需要仅针对业务场景数据训练人声分离模型和切换点检测模型,训练阶段的时间、人力等成本大大降低,在检测阶段的步骤可以并行处理,使得话者分离所需的时间大大缩短,并且,使用人声分离模型去除噪声信息的干扰,精确度也大大提升。It should be noted that since the input data in the clustering process is m monophonic data, the input data in the speech recognition process can also be m monophonic data, so the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience. Figure 4 shows a schematic flow chart of an audio processing method provided by the present application. It can be seen from Figure 4 that, compared with the speaker separation method in the embodiment of Figure 1, the audio processing method provided by the present application only needs to focus on services during the training phase. The scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation The model removes the interference of noise information, and the accuracy is greatly improved.
上述方法中,通过获取音频数据;根据所述音频数据,获得人声音频数据以及切换点;根据所述切换点,将所述人声音频数据转换为m个单声数据;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者,从而确认所述n个音频组中每个音频组所属的人声的身份。使得音频数据通过人声分离模型,可以极大程度地去除了噪音等干扰因素,进而提高复杂环境下话者分离的准确性,并且,在检测阶段的多个步骤都可以并行处理,简化了话者分离的流程,提高分离速度。In the above method, by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m The monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
图5是本申请提供的另一种音频处理的方法。其中,图5所述的音频处理方法与图2所示的音频处理方法的不同之处在于,提前进行语音识别的过程,获得音频数据的文本特征,使得切换点检测模型可以综合考虑音频特征和文本特征,更加精确的检测出对话者的切换点。图5所示的音频处理方法包括以下步骤:Figure 5 is another audio processing method provided by this application. The difference between the audio processing method shown in FIG. 5 and the audio processing method shown in FIG. 2 is that the process of voice recognition is performed in advance to obtain the text characteristics of the audio data, so that the switching point detection model can comprehensively consider the audio characteristics and The text feature can detect the interlocutor’s switching point more accurately. The audio processing method shown in FIG. 5 includes the following steps:
S501:获取音频数据。S501: Acquire audio data.
具体实现中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数。音频数据具体可以是需要进行话者分离的电话录音或者视频录音等音频文件,比如,用户与客服电话沟通的录音文件,会议的视频录音文件等等。其中,本申请中的噪声概念为不需要接下来进行话者分离的非目标人声,具体可以是环境音、静音以及录音设备产生的设备噪声等等。需要说明的是,环境音可能也出现人声,比如用户A在嘈杂的餐厅与客服B电话沟通生成的音频数据,此时餐厅中其他人说话的声音虽然也是人声,但不属于接下来需要进行话者分离的目标人声,因此也将被列为噪声。应理解,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer. The audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on. Among them, the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on. It should be noted that human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need. The target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
S502:通过语音识别方法,将所述音频数据转换为相应的文本数据。S502: Convert the audio data into corresponding text data through a voice recognition method.
具体实现中,可以通过DTW、HMM理论、VQ技术以及ANN等等语音识别方法,将所述音频数据转换为相应的文本数据,本申请不作具体限定。In specific implementation, the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
S503:根据所述音频数据以及文本数据,获得人声音频数据以及切换点。S503: Obtain human voice audio data and switching points according to the audio data and text data.
具体实现中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点。其中,切换点指的是一个对话者切换到另一个对话者的对话时间点。举例来说,假设n=2,也就是两个对话者,分别为用户A与客服B,在某一应用场景下,A与B对话音频数据用文字表示可以为:(00:12)A:“我想查话费”(00:14)B:“请问您是查本机话费吗”(00:18)A:“是的”(00:20)B:“请稍等”(00:25)B:“您的话费余额还有25元”。其中,A、B前面的数字代表当前录音的播放时间。此时切换点共有3个,可以是00:14、00:18以及00:20。应理解,由于步骤S502已对音频数据进行了语音识别,因此切换点检测模型可以综合考虑音频特征以及文本特征,获得更加精确的切换点。上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points. The above examples are only for illustration and do not constitute a specific limitation.
在一具体的实施方式中,所述根据所述音频数据以及音频文本,获得人声音频数据以及切换点包括:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。人声分离模型具体可以是VAD模型,采用类似事件检测方案,对人声和非人声之间的区别特征进行学习,从而获得人声分离模型。可以理解的是,VAD的检测结果极大程度上去除了环境噪声、非目标人声、设备噪声等干扰因素,相比图1实施例中只检测静音和非静音,最终获得的分离结果准确度更高。In a specific embodiment, the obtaining human voice audio data and switching points according to the audio data and audio text includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein The human voice separation model is a model obtained by training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain all The switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The human voice separation model may specifically be a VAD model, which uses a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, so as to obtain the human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
需要说明的,由于语音识别步骤的输入数据为音频数据,人声分离步骤的输入数据为音频数据,因此两个步骤可以同时进行,再进行切换点检测步骤。或者,先进行语音识别步骤,再同时进行人声分离步骤以及切换点检测步骤。从而尽可以能的并行执行任务,最 大程度降低话者分离流程所需的时间。并且,所述切换点是基于音频数据的时间轴获得的对话时间点,因此去除噪声后的人声音频数据的时间轴仍与音频数据相同,以便接下来基于相同的时间轴,使用切换点将人声音频数据转换为m个单声数据。并且,切换点检测模型以及人声分离模型还可以充分利用和业务场景相关的有标签数据进行训练,从而进一步提升切换点检测以及人声分离的准确度。It should be noted that since the input data of the speech recognition step is audio data, and the input data of the human voice separation step is audio data, the two steps can be performed simultaneously, and then the switching point detection step is performed. Alternatively, the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. In this way, tasks are executed in parallel as much as possible, and the time required for the speaker separation process is minimized. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
可以理解的,本申请实施例的切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型,该切换点检测模型可以综合考虑音频特征和文本特征,更加精确的检测出对话者的切换点,进一步提升话者分离的准确度。It is understandable that the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
S504:根据所述切换点,将所述人声数据转换为m个单声数据。S504: According to the switching point, convert the human voice data into m monophonic data.
具体实现中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数。仍以上述例子为例,在获得3个切换点后,可以将音频数据转换为4个单声数据,这4个单声数据可以如图3所示。其中,每个单声数据都只有一个对话者的人声,比如,单声数据可以是“我想查话费”这一句话对应的音频数据,还可以是“请稍等”、“您的话费余额还有25元”这两句话对应的音频数据。也就是说,单声数据中包含的人声只有用户A或者客服B。应理解,上述举例仅用于说明,并不能构成具体限定。并且,由于步骤S502已对音频数据进行了语音识别,因此文本数据也可以转换为m个单声文本,每个单声文本与单声数据对应,如图3所示,以便接下来进行聚类。In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. Moreover, since the audio data has been speech recognized in step S502, the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
S505:将所述m个单声数据进行聚类,得到n个音频组。S505: Cluster the m monophonic data to obtain n audio groups.
具体实现中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。仍以上述例子为例,在获得4个单声数据之后,对这4个单声数据进行聚类,得到2个音频组,具体可以如图3所示,一类音频组为:“我想查话费”以及“是的”对应的2个单声数据,另一类音频组为:“请问您是查本机话费吗”、“请稍等,您的话费余额还有25元”对应的2个单声数据,最终将包含n个对话者人声的音频数据转换为n个音频组。In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: "I think 2 monophonic data corresponding to "check call charge" and "yes", the other type of audio group is: "Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan" corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
S506:根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。S506: Confirm the interlocutor to which each audio group in the n audio groups belongs according to the audio text.
具体实现中,可以通过每个单声数据对应的单声文本中的关键字或关键话术,确定每个音频组所述的对话者,比如,仍以上述例子为例,可以根据关键字“请问”“请稍等”确定图3所示的类别2的对话者为客服,进而确定类别1的对话者为用户,或者根据关键字“我想”“查话费”确定图3所示的类别1的对话者为用户,进而确定类别2的对话者为客服等等,对于关键字的设定,本申请不作具体限定。In specific implementation, the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait" to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and the interlocutor of category 2 is determined to be customer service, etc. The application does not specifically limit the setting of keywords.
图6显示了本申请提供的一种音频处理方法的流程示意图,由图6可知,本申请提供的音频处理方法,与图1实施例中话者分离方法相比,在训练阶段仅需要针对业务场景数据训练人声分离模型和切换点检测模型,训练阶段的时间、人力等成本大大降低,在检测阶段的步骤可以并行处理,使得话者分离所需的时间大大缩短,并且,使用人声分离模型去除噪声信息的干扰,切换点检测模型综合了文本特征以及音频特征,使得切换点检测的 准确度有极大提升,进而提升话者分离的精确度。Fig. 6 shows a schematic flow chart of an audio processing method provided by this application. It can be seen from Fig. 6 that, compared with the speaker separation method in the embodiment of Fig. 1, the audio processing method provided by this application only needs to focus on services during the training phase. The scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation The model removes the interference of noise information, and the switching point detection model integrates text features and audio features, which greatly improves the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
上述方法中,通过获取音频数据;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点;根据所述切换点,将所述人声音频数据转换为m个单声数据;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者,从而确认所述n个音频组中每个音频组所属的人声的身份。使得音频数据通过人声分离模型,可以极大程度地去除了噪音等干扰因素,进而提高复杂环境下话者分离的准确性,并且,在检测阶段的多个步骤都可以并行处理,简化了话者分离的流程,提高分离速度。In the above method, audio data is acquired; the audio data is converted into corresponding text data through a voice recognition method; human voice audio data and switching points are obtained according to the audio data and text data; according to the switching points, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups The data belong to the same interlocutor, thereby confirming the identity of the human voice to which each audio group in the n audio groups belongs. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
图7是本申请提供的一种音频处理系统,所述音频处理系统700包括获取单元710、获得单元720、转换单元730以及聚类单元740,其中,FIG. 7 is an audio processing system provided by the present application. The audio processing system 700 includes an acquisition unit 710, an acquisition unit 720, a conversion unit 730, and a clustering unit 740, wherein:
所述获取单元710用于获取音频数据,其中,所述音频数据中包含噪声和n个对话者进行对话时产生的人声,n是正整数;The acquiring unit 710 is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
所述获得单元720用于根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;The obtaining unit 720 is configured to obtain human voice audio data and a switching point according to the audio data, where the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is Any one of the n interlocutors switches to the conversation time point of another interlocutor;
所述转换单元730用于切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包含单个对话者的人声,m是正整数;The conversion unit 730 is used for switching points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only contains the human voice of a single interlocutor, m is a positive integer;
所述聚类单元740用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组中的单声数据属于同一个对话者。The clustering unit 740 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same interlocutor .
具体实现中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数。音频数据具体可以是需要进行话者分离的电话录音或者视频录音等音频文件,比如,用户与客服电话沟通的录音文件,会议的视频录音文件等等。其中,本申请中的噪声概念为不需要接下来进行话者分离的非目标人声,具体可以是环境音、静音以及录音设备产生的设备噪声等等。需要说明的是,环境音可能也出现人声,比如用户A在嘈杂的餐厅与客服B电话沟通生成的音频数据,此时餐厅中其他人说话的声音虽然也是人声,但不属于接下来需要进行话者分离的目标人声,因此也将被列为噪声。应理解,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer. The audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on. Among them, the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on. It should be noted that human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need. The target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
具体实现中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点。其中,切换点指的是一个对话者切换到另一个对话者的对话时间点。举例来说,假设n=2,也就是两个对话者,分别为用户A与客服B,在某一应用场景下,A与B对话音频数据用文字表示可以为:(00:12)A:“我想查话费”(00:14)B:“请问您是查本机话费吗”(00:18)A:“是的”(00:20)B:“请稍等”(00:25)B:“您的话费余额还有25元”。其中,A、B前面的数字代表当前录音的播放时间。此时切换点共有3个,可以是00:14、00:18以及00:20。应理解,为了能更清楚的表达本申请提供的音频处理方 法的主要思想,因此在上述举例中将对话内容直接以文字的形式进行表示,但是在实际处理过程中,由于步骤S202并没有进行语音识别,因此切换点是在不知道A与B的对话内容的情况下,仅根据A与B的对话音频数据获得的。并且,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that, in order to more clearly express the main idea of the audio processing method provided by this application, the content of the dialogue is directly expressed in the form of text in the above example, but in the actual processing process, since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B. In addition, the above examples are only for illustration and cannot constitute a specific limitation.
在一具体的实施方式中,所述获得单元720用于:将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,将所述音频数据输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。人声分离模型具体可以是语音活动检测模型(Voice Activity Detection,VAD),采用类似事件检测方案,对人声和非人声之间的区别特征进行学习,从而获得人声分离模型。可以理解的是,VAD的检测结果极大程度上去除了环境噪声、非目标人声、设备噪声等干扰因素,相比图1实施例中只检测静音和非静音,最终获得的分离结果准确度更高。In a specific embodiment, the obtaining unit 720 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples and The model obtained by training the neural network on the corresponding known human voice data samples; at the same time, the audio data is input into the switching point detection model to obtain the switching point, wherein the switching point detection model uses the known audio The model obtained after training the neural network with the samples and the corresponding known switching point samples. Wherein, the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples. The human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
需要说明的,步骤S202中,由于人声分离模型以及切换点检测模型的输入数据都是音频数据,因此人声分离可以和切换点检测并行进行,二者不会互相干扰,从而一定减少话者分离所需的时间。并且,所述切换点是基于音频数据的时间轴获得的对话时间点,因此去除噪声后的人声音频数据的时间轴仍与音频数据相同,以便接下来基于相同的时间轴,使用切换点将人声音频数据转换为m个单声数据。并且,切换点检测模型以及人声分离模型还可以充分利用和业务场景相关的有标签数据进行训练,从而进一步提升切换点检测以及人声分离的准确度。It should be noted that in step S202, since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers. The time required for separation. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
具体实现中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数。仍以上述例子为例,在获得3个切换点后,可以将音频数据转换为4个单声数据,这4个单声数据可以如图3所示。其中,每个单声数据都只有一个对话者的人声,比如,单声数据可以是“我想查话费”这一句话对应的音频数据,还可以是“请稍等”、“您的话费余额还有25元”这两句话对应的音频数据。也就是说,单声数据中包含的人声只有用户A或者客服B。应理解,上述举例仅用于说明,并不能构成具体限定。并且,图3为了便于更好的理解单声数据的定义,图3是将单声数据以文本的形式进行了展示,但是在实际情况中,由于步骤S203并没有进行语音识别,因此单声数据仅仅是一个音频数据,并没有包含文本信息。In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. In addition, in order to facilitate a better understanding of the definition of the monophonic data, FIG. 3 shows the monophonic data in the form of text. However, in actual situations, since no speech recognition is performed in step S203, the monophonic data It is only audio data, and does not contain text information.
S204:将所述m个单声数据进行聚类,得到n个音频组。S204: Cluster the m monophonic data to obtain n audio groups.
具体实现中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。仍以上述例子为例,在获得4个单声数据之后,对这4个单声数据进行聚类(Clustering),得到2个音频组,具体可以如图3所示,一类音频组为:“我想查话费”以及“是的”对应的2个单声数据,另一类音频组为:“请问您是查本机话费吗”、“请稍等,您的话费余额还有 25元”对应的2个单声数据,最终将包含n个对话者人声的音频数据转换为n个音频组。In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: "I want to check the phone bill" and "Yes" corresponding to the 2 monophonic data, the other type of audio group is: "Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
在一具体的实施方式中,所述系统还包括确认单元750,所述确认单元750用于:在所述将所述m个单声数据进行聚类,得到n个音频组之后,通过语音识别方法,将所述m个单声数据转换为相应的m个文本信息;根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。具体实现中,可以通过文本信息中的关键字或关键话术,确定每个音频组所述的对话者,比如,仍以上述例子为例,可以根据关键字“请问”“请稍等”确定类别2的对话者为客服,进而确定类别1的对话者为用户,或者根据关键字“我想”“查话费”确定类别1的对话者为用户,进而确定类别2的对话者为客服等等,对于关键字的设定,本申请不作具体限定。In a specific implementation, the system further includes a confirmation unit 750, which is configured to: after clustering the m monophonic data to obtain n audio groups, perform voice recognition Method: Convert the m monophonic data into corresponding m text information; according to the m text information, confirm the interlocutor to which each audio group of the n audio groups belongs. In specific implementation, the interlocutor described in each audio group can be determined by keywords or key words in the text information. For example, taking the above example as an example, you can determine according to the keywords "Excuse me" and "Please wait" The interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc. , This application does not specifically limit the setting of keywords.
具体实现中,可以通过动态时间规整(Dynamic Time Warping,DTW)、隐马尔可夫(Hidden Markov Model,HMM)理论、矢量量化(Vector Quantization,VQ)技术以及基于人工神经网络(Artificial Neural Network,ANN)等等语音识别方法,将所述m个单声数据转换为相应的m个文本信息,本申请不作具体限定。In specific implementation, dynamic time warping (Dynamic Time Warping, DTW), Hidden Markov Model (HMM) theory, vector quantization (VQ) technology, and artificial neural network (Artificial Neural Network, ANN) ) And other voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
需要说明的,由于聚类过程中的输入数据为m个单声数据,语音识别过程中的输入数据也可以是m个单声数据,因此聚类和语音识别的过程也可以并行处理,从而进一步提升整个话者分离的流程,提高用户使用体验。由图7可知,本申请提供的音频处理系统,与图1实施例中话者分离系统相比,在训练阶段需要仅针对业务场景数据训练人声分离模型和切换点检测模型,训练阶段的时间、人力等成本大大降低,在检测阶段的步骤可以并行处理,使得话者分离所需的时间大大缩短,并且,使用人声分离模型去除噪声信息的干扰,精确度也大大提升。It should be noted that since the input data in the clustering process is m monophonic data, the input data in the speech recognition process can also be m monophonic data, so the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience. It can be seen from FIG. 7 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application needs to train the human voice separation model and the switching point detection model only for the business scene data in the training phase. The cost of labor and labor is greatly reduced, and the steps in the detection stage can be processed in parallel, which greatly shortens the time required for speaker separation. Moreover, the use of the human voice separation model to remove the interference of noise information greatly improves the accuracy.
上述系统中,通过获取音频数据;根据所述音频数据,获得人声音频数据以及切换点;根据所述切换点,将所述人声音频数据转换为m个单声数据;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者,从而确认所述n个音频组中每个音频组所属的人声的身份。使得音频数据通过人声分离模型,可以极大程度地去除了噪音等干扰因素,进而提高复杂环境下话者分离的准确性,并且,在检测阶段的多个步骤都可以并行处理,简化了话者分离的流程,提高分离速度。In the above system, by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m The monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
图8是本申请提供的另一种音频处理系统,其中,图8所述的音频处理系统与图7所示的音频处理系统的不同之处在于,提前进行语音识别的过程,获得音频数据的文本特征,使得切换点检测模型可以综合考虑音频特征和文本特征,更加精确的检测出对话者的切换点。图8所示的音频处理系统800包括获取单元810、第一转换单元820、获得单元830、第二转换单元840、聚类单元850、确认单元860,其中,Fig. 8 is another audio processing system provided by the present application. The difference between the audio processing system shown in Fig. 8 and the audio processing system shown in Fig. 7 is that the process of voice recognition is performed in advance to obtain audio data. Text features enable the switching point detection model to comprehensively consider audio features and text features, and more accurately detect the switch point of the interlocutor. The audio processing system 800 shown in FIG. 8 includes an acquisition unit 810, a first conversion unit 820, an acquisition unit 830, a second conversion unit 840, a clustering unit 850, and a confirmation unit 860, where
所述获取单元810用于获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;The acquiring unit 810 is configured to acquire audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
所述转换单元820用于通过语音识别方法,将所述音频数据转换为音频文本;The conversion unit 820 is configured to convert the audio data into audio text through a voice recognition method;
所述获得单元830用于根据所述音频数据以及音频文本,获得人声音频数据以及切换 点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;The obtaining unit 830 is configured to obtain human voice audio data and switching points according to the audio data and audio text, where the human voice audio data is audio data obtained by removing the noise from the audio data, and The switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;
所述第二转换单元840用于根据所述人声音频数据以及切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;The second conversion unit 840 is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein each of the m monophonic data The data only includes the human voice of a single interlocutor, and m is a positive integer;
所述聚类单元850用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;The clustering unit 850 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
所述确认单元860用于根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。The confirmation unit 860 is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
具体实现中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数。音频数据具体可以是需要进行话者分离的电话录音或者视频录音等音频文件,比如,用户与客服电话沟通的录音文件,会议的视频录音文件等等。其中,本申请中的噪声概念为不需要接下来进行话者分离的非目标人声,具体可以是环境音、静音以及录音设备产生的设备噪声等等。需要说明的是,环境音可能也出现人声,比如用户A在嘈杂的餐厅与客服B电话沟通生成的音频数据,此时餐厅中其他人说话的声音虽然也是人声,但不属于接下来需要进行话者分离的目标人声,因此也将被列为噪声。应理解,上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer. The audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on. Among them, the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on. It should be noted that human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need. The target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
具体实现中,可以通过DTW、HMM理论、VQ技术以及ANN等等语音识别方法,将所述音频数据转换为相应的文本数据,本申请不作具体限定。In specific implementation, the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
具体实现中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点。其中,切换点指的是一个对话者切换到另一个对话者的对话时间点。举例来说,假设n=2,也就是两个对话者,分别为用户A与客服B,在某一应用场景下,A与B对话音频数据用文字表示可以为:(00:12)A:“我想查话费”(00:14)B:“请问您是查本机话费吗”(00:18)A:“是的”(00:20)B:“请稍等”(00:25)B:“您的话费余额还有25元”。其中,A、B前面的数字代表当前录音的播放时间。此时切换点共有3个,可以是00:14、00:18以及00:20。应理解,由于步骤S502已对音频数据进行了语音识别,因此切换点检测模型可以综合考虑音频特征以及文本特征,获得更加精确的切换点。上述举例仅用于说明,并不能构成具体限定。In a specific implementation, the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data, and the switching point is any one of the n interlocutors. The person switches to the conversation time point of another interlocutor. Among them, the switching point refers to the point in time when one interlocutor switches to another interlocutor. For example, suppose n=2, that is, two interlocutors, namely user A and customer service B. In a certain application scenario, the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill" (00:14) B: "Are you checking the phone bill of this phone" (00:18) A: "Yes" (00:20) B: "Please wait a moment" (00:25 ) B: "You still have 25 yuan in your balance of call charges". Among them, the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points. The above examples are only for illustration and do not constitute specific limitations.
在一具体的实施方式中,所述获得单元830用于:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。人声分离模型具体可以是VAD模型,采用类似事件检测方案,对人声和非人声之间的区别 特征进行学习,从而获得人声分离模型。可以理解的是,VAD的检测结果极大程度上去除了环境噪声、非目标人声、设备噪声等干扰因素,相比图1实施例中只检测静音和非静音,最终获得的分离结果准确度更高。In a specific embodiment, the obtaining unit 830 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples And the model obtained by training the neural network with the corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain the switching point, wherein the switching point detection model is A model obtained by training the neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The human voice separation model can be a VAD model, which uses a similar event detection scheme to learn the distinguishing characteristics between human voice and non-human voice to obtain a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
需要说明的,由于语音识别步骤的输入数据为音频数据,人声分离步骤的输入数据为音频数据,因此两个步骤可以同时进行,再进行切换点检测步骤。或者,先进行语音识别步骤,再同时进行人声分离步骤以及切换点检测步骤。从而尽可以能的并行执行任务,最大程度降低话者分离流程所需的时间。并且,所述切换点是基于音频数据的时间轴获得的对话时间点,因此去除噪声后的人声音频数据的时间轴仍与音频数据相同,以便接下来基于相同的时间轴,使用切换点将人声音频数据转换为m个单声数据。并且,切换点检测模型以及人声分离模型还可以充分利用和业务场景相关的有标签数据进行训练,从而进一步提升切换点检测以及人声分离的准确度。It should be noted that since the input data of the speech recognition step is audio data, and the input data of the human voice separation step is audio data, the two steps can be performed simultaneously, and then the switching point detection step is performed. Alternatively, the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. So as to perform tasks in parallel as much as possible, and minimize the time required for the speaker separation process. Moreover, the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change The human voice audio data is converted into m monophonic data. In addition, the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
可以理解的,本申请实施例的切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型,该切换点检测模型可以综合考虑音频特征和文本特征,更加精确的检测出对话者的切换点,进一步提升话者分离的准确度。It is understandable that the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples. The switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
具体实现中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数。仍以上述例子为例,在获得3个切换点后,可以将音频数据转换为4个单声数据,这4个单声数据可以如图3所示。其中,每个单声数据都只有一个对话者的人声,比如,单声数据可以是“我想查话费”这一句话对应的音频数据,还可以是“请稍等”、“您的话费余额还有25元”这两句话对应的音频数据。也就是说,单声数据中包含的人声只有用户A或者客服B。应理解,上述举例仅用于说明,并不能构成具体限定。并且,由于步骤S502已对音频数据进行了语音识别,因此文本数据也可以转换为m个单声文本,每个单声文本与单声数据对应,如图3所示,以便接下来进行聚类。In a specific implementation, each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer. Still taking the above example as an example, after obtaining 3 switching points, the audio data can be converted into 4 mono data, which can be shown in FIG. 3. Among them, each monophonic data has only one interlocutor’s human voice. For example, the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait" or "Your phone bill" There is also the audio data corresponding to the two sentences of 25 yuan in the balance. In other words, the human voice included in the monophonic data is only user A or customer service B. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation. Moreover, since the audio data has been speech recognized in step S502, the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
具体实现中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。仍以上述例子为例,在获得4个单声数据之后,对这4个单声数据进行聚类,得到2个音频组,具体可以如图3所示,一类音频组为:“我想查话费”以及“是的”对应的2个单声数据,另一类音频组为:“请问您是查本机话费吗”、“请稍等,您的话费余额还有25元”对应的2个单声数据,最终将包含n个对话者人声的音频数据转换为n个音频组。In specific implementation, the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: "I think 2 monophonic data corresponding to "check call charge" and "yes", the other type of audio group is: "Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan" corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
具体实现中,可以通过每个单声数据对应的单声文本中的关键字或关键话术,确定每个音频组所述的对话者,比如,仍以上述例子为例,可以根据关键字“请问”“请稍等”确定图3所示的类别2的对话者为客服,进而确定类别1的对话者为用户,或者根据关键字“我想”“查话费”确定图3所示的类别1的对话者为用户,进而确定类别2的对话者为客服等等,对于关键字的设定,本申请不作具体限定。In specific implementation, the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait" to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and then the interlocutor of category 2 is determined to be customer service, etc. This application does not specifically limit the setting of keywords.
可以理解的,由图8可知,本申请提供的音频处理系统,与图1实施例中话者分离系统相比,在训练阶段仅需要针对业务场景数据训练人声分离模型和切换点检测模型,训练 阶段的时间、人力等成本大大降低,在检测阶段的步骤可以并行处理,使得话者分离所需的时间大大缩短,并且,使用人声分离模型去除噪声信息的干扰,切换点检测模型综合了文本特征以及音频特征,使得切换点检测的准确度有极大提升,进而提升话者分离的精确度。It can be understood from FIG. 8 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application only needs to train the human voice separation model and the switching point detection model for the business scene data in the training phase. The time and labor costs of the training phase are greatly reduced. The steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation. In addition, the human voice separation model is used to remove the interference of noise information, and the switching point detection model is integrated The text features and audio features greatly improve the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
上述系统中,通过获取音频数据;通过语音识别方法,将所述音频数据转换为相应的文本数据;根据所述音频数据以及文本数据,获得人声音频数据以及切换点;根据所述切换点,将所述人声音频数据转换为m个单声数据;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者,从而确认所述n个音频组中每个音频组所属的人声的身份。使得音频数据通过人声分离模型,可以极大程度地去除了噪音等干扰因素,进而提高复杂环境下话者分离的准确性,并且,在检测阶段的多个步骤都可以并行处理,简化了话者分离的流程,提高分离速度。In the above system, by acquiring audio data; using a voice recognition method, the audio data is converted into corresponding text data; according to the audio data and text data, the human voice audio data and the switching point are obtained; according to the switching point, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups The data belong to the same interlocutor, thereby confirming the identity of the human voice to which each audio group in the n audio groups belongs. Through the human voice separation model, the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments. In addition, multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
参见图9,图9是本申请中提供的一种服务器的结构示意图。该服务器可以实现如图2实施例或者图5实施例中的方法。需要说明的是,本申请提供的数据处理方法可以在如图9所示的云服务集群中实现,也可以在单个计算机节点和存储节点中实现,本申请不作具体限定,所述云服务集群包括至少一个计算节点910以及至少一个存储节点920。Refer to FIG. 9, which is a schematic structural diagram of a server provided in this application. The server may implement the method in the embodiment in FIG. 2 or the embodiment in FIG. 5. It should be noted that the data processing method provided in this application can be implemented in a cloud service cluster as shown in FIG. 9 or in a single computer node and a storage node. This application is not specifically limited. The cloud service cluster includes At least one computing node 910 and at least one storage node 920.
计算节点910包括一个或多个处理器911、通信接口912和存储器913。其中,处理器911、通信接口912和存储器913之间可以通过总线914连接。The computing node 910 includes one or more processors 911, a communication interface 912, and a memory 913. The processor 911, the communication interface 912, and the memory 913 may be connected through a bus 914.
处理器911包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括中央处理器(Central Processing Unit,CPU)、微处理器、微控制器、主处理器、控制器以及ASIC(Application Specific Integrated Circuit,专用集成电路)等等。它能够是仅用于计算节点910的专用处理器或者能够与其它计算节点910共享。处理器911执行各种类型的数字存储指令,例如存储在存储器913中的软件或者固件程序,它能使计算节点910提供较宽的多种服务。例如,处理器911能够执行聚类、人声分离以及切换点检测等模块的代码,以执行本文讨论的方法的至少一部分。通信接口912可以为有线接口(例如以太网接口),用于与其他计算节点或用户进行通信。当通信接口912为有线接口时,通信接口912可以采用TCP/IP之上的协议族,例如,RAAS协议、远程函数调用(Remote Function Call,RFC)协议、简单对象访问协议(Simple Object Access Protocol,SOAP)协议、简单网络管理协议(Simple Network Management Protocol,SNMP)协议、公共对象请求代理体系结构(Common Object Request Broker Architecture,CORBA)协议以及分布式协议等等。存储器913可以包括易失性存储器(Volatile Memory),例如随机存取存储器(Random Access Memory,RAM);存储器也可以包括非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD)存储器还可以包括上述种类的存储器的组合。存储节点920包括一个或多个存储控制器921和存储阵列922。其中,存储控制器921和存储阵列922之间可以通过总线924连接。存储 控制器921包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括CPU、微处理器、微控制器、主处理器、控制器以及ASIC等等。它能够是仅用于单个存储节点920的专用处理器或者能够与计算节点910或者其它存储节点920共享。可以理解,在本实施例中,每个存储节点包括一个存储控制器,在其他的实施例中,也可以多个存储节点共享一个存储控制器,此处不作具体限定。存储器阵列922可以包括多个存储器923。存储器923可以是非易失性存储器,例如ROM、快闪存储器、HDD或SSD存储器还可以包括上述种类的存储器的组合。例如,存储阵列可以是由多个HDD或者多个SDD组成,或者,存储阵列可以是由HDD以及SDD组成。其中,多个存储器在存储控制器921将的协助下按不同的方式组合起来形成存储器组,从而提供比单个存储器更高的存储性能和提供数据备份技术。可选地,存储器阵列923可以包括一个或者多个数据中心。多个数据中心可以设置在同一个地点,或者,分别在不同的地点,此处不作具体限定。存储器阵列923可以存储有程序代码以及程序数据。其中,程序代码包括语音识别模块代码、语义理解模块代码以及订单生产模块代码等等。程序数据包括:聚类模块代码数据、人声分离模块代码数据、切换点检测模块代码数据等等,还可以包括订人声分离模型、切换点检测模型以及对应的训练样本集数据等等,本申请不作具体限定。The processor 911 includes one or more general-purpose processors. The general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on. It can be a dedicated processor used only for the computing node 910 or can be shared with other computing nodes 910. The processor 911 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 913, which enables the computing node 910 to provide a wide variety of services. For example, the processor 911 can execute codes of modules such as clustering, human voice separation, and switching point detection to execute at least a part of the methods discussed herein. The communication interface 912 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users. When the communication interface 912 is a wired interface, the communication interface 912 may adopt a protocol family over TCP/IP, for example, RAAS protocol, Remote Function Call (RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (SNMP) protocol, Common Object Request Broker Architecture (CORBA) protocol, distributed protocol, etc. The memory 913 may include a volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read-Only Memory). Memory, ROM, Flash Memory, Hard Disk Drive (HDD), or Solid-State Drive (SSD) memory may also include a combination of the foregoing types of memories. The storage node 920 includes one or more storage controllers 921 and a storage array 922. Among them, the storage controller 921 and the storage array 922 may be connected through a bus 924. The storage controller 921 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait. It can be a dedicated processor used only for a single storage node 920 or can be shared with the computing node 910 or other storage nodes 920. It can be understood that, in this embodiment, each storage node includes a storage controller. In other embodiments, multiple storage nodes may also share a storage controller, which is not specifically limited here. The memory array 922 may include a plurality of memories 923. The memory 923 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory. For example, the storage array may be composed of multiple HDDs or multiple SDDs, or the storage array may be composed of HDDs and SDDs. Among them, multiple memories are combined in different ways with the assistance of the storage controller 921 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology. Optionally, the memory array 923 may include one or more data centers. Multiple data centers can be set up at the same location, or at different locations, and there is no specific limitation here. The memory array 923 may store program codes and program data. Among them, the program code includes voice recognition module code, semantic understanding module code, order production module code, and so on. The program data includes: clustering module code data, human voice separation module code data, switching point detection module code data, etc., and can also include subscription human voice separation models, switching point detection models, and corresponding training sample set data, etc. The application is not specifically limited.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、存储盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态存储盘Solid State Disk(SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium, (for example, a floppy disk, a storage disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
以上所述仅为本申请的几个实施例,本领域的技术人员依据申请文件公开的可以对本申请进行各种改动或变型而不脱离本申请的精神和范围。The above are only a few embodiments of the application, and those skilled in the art can make various changes or modifications to the application according to the disclosure of the application documents without departing from the spirit and scope of the application.

Claims (19)

  1. 一种音频处理的方法,其特征在于,所述方法包括:An audio processing method, characterized in that the method includes:
    获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors after the noise is removed from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;
    根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice audio data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor.
  2. 根据权利要求1所述的方法,其特征在于,在所述将所述m个单声数据进行聚类,得到n个音频组之后,所述方法还包括:The method according to claim 1, wherein after clustering the m monophonic data to obtain n audio groups, the method further comprises:
    通过语音识别方法,将所述m个单声数据转换为相应的m个文本信息;Converting the m monophonic data into corresponding m text information through a voice recognition method;
    根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述音频数据,获得人声音频数据以及切换点包括:The method according to claim 1, wherein the obtaining human voice audio data and switching points according to the audio data comprises:
    将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is a model obtained after training a neural network using known audio samples and corresponding known human voice data samples ;Simultaneously,
    将所述音频数据输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
  4. 一种音频处理的方法,其特征在于,所述方法包括:An audio processing method, characterized in that the method includes:
    获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;
    根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;
    根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
    根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述音频数据以及音频文本, 获得人声音频数据以及切换点包括:The method according to claim 4, wherein the obtaining human voice audio data and switching points according to the audio data and audio text comprises:
    将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,
    将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
  6. 一种音频处理系统,其特征在于,所述系统包括:An audio processing system, characterized in that the system includes:
    获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包含噪声和n个对话者进行对话时产生的人声,n是正整数;An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    获得单元,所述获得单元用于根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;An obtaining unit, the obtaining unit is configured to obtain human voice audio data and switching points according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;
    转换单元,所述转换单元用于切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包含单个对话者的人声,m是正整数;A conversion unit, which is used to switch points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data contains only a single interlocutor’s person Sound, m is a positive integer;
    聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组中的单声数据属于同一个对话者。A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same An interlocutor.
  7. 根据权利要求6所述的系统,其特征在于,所述系统还包括:The system of claim 6, wherein the system further comprises:
    确认单元,所述确认单元用于通过语音识别方法,将所述m个单声数据转换为相应的m个文本信息;A confirmation unit, which is used to convert the m monophonic data into corresponding m text information through a voice recognition method;
    根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
  8. 根据权利要求6所述的系统,其特征在于,所述获得单元还用于将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The system according to claim 6, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses Know the audio samples and the corresponding known human voice data samples after training the neural network model; at the same time,
    将所述音频数据输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
  9. 一种音频处理系统,其特征在于,所述系统包括:An audio processing system, characterized in that the system includes:
    获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    第一转换单元,所述转换单元用于通过语音识别方法,将所述音频数据转换为音频文本;A first conversion unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method;
    获得单元,所述获得单元用于根据所述音频数据以及音频文本,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;An obtaining unit for obtaining human voice audio data and switching points according to the audio data and audio text, wherein the human voice audio data is audio data obtained by removing the noise from the audio data; The switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;
    第二转换单元,所述第二转换单元用于根据所述人声音频数据以及切换点,将所述人 声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;The second conversion unit, the second conversion unit is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein among the m monophonic data Each monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same one interlocutor;
    确认单元,所述确认单元用于根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。The confirmation unit is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  10. 根据权利要求9所述的系统,其特征在于,所述获得单元还用于将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The system according to claim 9, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is used The model obtained by training the neural network with known audio samples and corresponding known human voice data samples; at the same time,
    将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
  11. 一种服务器,其特征在于,包括处理器以及存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行如权利要求1至3任一项所述的方法。A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the instructions when executing any one of claims 1 to 3. The method described in the item.
  12. 一种服务器,其特征在于,包括处理器以及存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行以下步骤:A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions:
    获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;
    根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;
    根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
    根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  13. 根据权利要求12所述的服务器,其特征在于,所述处理器执行所述指令时还执行以下步骤:The server according to claim 12, wherein the processor further executes the following steps when executing the instruction:
    将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,
    将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对 神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
  14. 一种计算机非瞬态存储介质,其特征在于,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现如权利要求1至3任一项所述的方法。A computer non-transitory storage medium, characterized in that the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the computer program is implemented as described in any one of claims 1 to 3. The method described.
  15. 一种计算机非瞬态存储介质,其特征在于,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现以下步骤:A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the following steps are implemented:
    获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;
    根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;
    根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
    根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  16. 根据权利要求15所述的计算机非瞬态存储介质,其特征在于,所述计算机程序被计算设备执行时还实现以下步骤:The computer non-transitory storage medium according to claim 15, wherein the computer program further implements the following steps when being executed by a computing device:
    将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,
    将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
  17. 一种计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,实现如权利要求1至3中任一项所述的方法。A computer program product, characterized in that, when the computer program product is read and executed by a computer, the method according to any one of claims 1 to 3 is implemented.
  18. 一种计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,实现以下步骤:A computer program product, characterized in that, when the computer program product is read and executed by a computer, the following steps are implemented:
    获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
    通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;
    根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;
    根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;
    将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
    根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
  19. 根据权利要求18所述的计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,还实现以下步骤:The computer program product according to claim 18, wherein when the computer program product is read and executed by a computer, the following steps are further implemented:
    将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,
    将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
PCT/CN2019/130550 2019-05-28 2019-12-31 Audio processing method, system and related device WO2020238209A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910453493.7A CN110335621A (en) 2019-05-28 2019-05-28 Method, system and the relevant device of audio processing
CN201910453493.7 2019-05-28

Publications (1)

Publication Number Publication Date
WO2020238209A1 true WO2020238209A1 (en) 2020-12-03

Family

ID=68140272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130550 WO2020238209A1 (en) 2019-05-28 2019-12-31 Audio processing method, system and related device

Country Status (2)

Country Link
CN (1) CN110335621A (en)
WO (1) WO2020238209A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing
CN110930989B (en) * 2019-11-27 2021-04-06 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN111243595B (en) * 2019-12-31 2022-12-27 京东科技控股股份有限公司 Information processing method and device
CN111968679B (en) * 2020-10-22 2021-01-29 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
CN112562644A (en) * 2020-12-03 2021-03-26 云知声智能科技股份有限公司 Customer service quality inspection method, system, equipment and medium based on human voice separation
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112735384A (en) * 2020-12-28 2021-04-30 科大讯飞股份有限公司 Turning point detection method, device and equipment applied to speaker separation
CN112802498B (en) * 2020-12-29 2023-11-24 深圳追一科技有限公司 Voice detection method, device, computer equipment and storage medium
CN112289323B (en) * 2020-12-29 2021-05-28 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN112966082A (en) * 2021-03-05 2021-06-15 北京百度网讯科技有限公司 Audio quality inspection method, device, equipment and storage medium
CN113593578A (en) * 2021-09-03 2021-11-02 北京紫涓科技有限公司 Conference voice data acquisition method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543402B1 (en) * 2010-04-30 2013-09-24 The Intellisis Corporation Speaker segmentation in noisy conversational speech
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium
US20180158451A1 (en) * 2016-12-01 2018-06-07 International Business Machines Corporation Prefix methods for diarization in streaming mode
CN108399923A (en) * 2018-02-01 2018-08-14 深圳市鹰硕技术有限公司 More human hairs call the turn spokesman's recognition methods and device
US20180286409A1 (en) * 2017-03-31 2018-10-04 International Business Machines Corporation Speaker diarization with cluster transfer
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109300470A (en) * 2018-09-17 2019-02-01 平安科技(深圳)有限公司 Audio mixing separation method and audio mixing separator
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100466671C (en) * 2004-05-14 2009-03-04 华为技术有限公司 Method and device for switching speeches
JP5103907B2 (en) * 2005-01-17 2012-12-19 日本電気株式会社 Speech recognition system, speech recognition method, and speech recognition program
CN101452704B (en) * 2007-11-29 2011-05-11 中国科学院声学研究所 Speaker clustering method based on information transfer
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN105895078A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method used for dynamically selecting speech model and device
US9942518B1 (en) * 2017-02-28 2018-04-10 Cisco Technology, Inc. Group and conversational framing for speaker tracking in a video conference system
CN108766459B (en) * 2018-06-13 2020-07-17 北京联合大学 Target speaker estimation method and system in multi-user voice mixing
CN109634692A (en) * 2018-10-23 2019-04-16 蔚来汽车有限公司 Vehicle-mounted conversational system and processing method and system for it
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN109545228A (en) * 2018-12-14 2019-03-29 厦门快商通信息技术有限公司 A kind of end-to-end speaker's dividing method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543402B1 (en) * 2010-04-30 2013-09-24 The Intellisis Corporation Speaker segmentation in noisy conversational speech
US20180158451A1 (en) * 2016-12-01 2018-06-07 International Business Machines Corporation Prefix methods for diarization in streaming mode
US20180286409A1 (en) * 2017-03-31 2018-10-04 International Business Machines Corporation Speaker diarization with cluster transfer
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium
CN108399923A (en) * 2018-02-01 2018-08-14 深圳市鹰硕技术有限公司 More human hairs call the turn spokesman's recognition methods and device
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109300470A (en) * 2018-09-17 2019-02-01 平安科技(深圳)有限公司 Audio mixing separation method and audio mixing separator
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing

Also Published As

Publication number Publication date
CN110335621A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
WO2020238209A1 (en) Audio processing method, system and related device
US10902843B2 (en) Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
US10249292B2 (en) Using long short-term memory recurrent neural network for speaker diarization segmentation
US10748531B2 (en) Management layer for multiple intelligent personal assistant services
US20230019978A1 (en) Automatic speech recognition correction
US10516782B2 (en) Conference searching and playback of search results
EP3254455B1 (en) Selective conference digest
US9293133B2 (en) Improving voice communication over a network
US20180336902A1 (en) Conference segmentation based on conversational dynamics
US10811005B2 (en) Adapting voice input processing based on voice input characteristics
US20180027351A1 (en) Optimized virtual scene layout for spatial meeting playback
WO2016126813A2 (en) Scheduling playback of audio in a virtual acoustic space
WO2022105861A1 (en) Method and apparatus for recognizing voice, electronic device and medium
Triantafyllopoulos et al. Deep speaker conditioning for speech emotion recognition
US20200004878A1 (en) System and method for generating dialogue graphs
US11270691B2 (en) Voice interaction system, its processing method, and program therefor
CN111798833A (en) Voice test method, device, equipment and storage medium
US10762906B2 (en) Automatically identifying speakers in real-time through media processing with dialog understanding supported by AI techniques
CN108877779B (en) Method and device for detecting voice tail point
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN113779208A (en) Method and device for man-machine conversation
CN113129866B (en) Voice processing method, device, storage medium and computer equipment
WO2019155716A1 (en) Information processing device, information processing system, information processing method, and program
CN111354350A (en) Voice processing method and device, voice processing equipment and electronic equipment
US20220201121A1 (en) System, method and apparatus for conversational guidance

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19931010

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19931010

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.04.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19931010

Country of ref document: EP

Kind code of ref document: A1