WO2020238209A1 - Audio processing method, system and related device - Google Patents
Audio processing method, system and related device Download PDFInfo
- Publication number
- WO2020238209A1 WO2020238209A1 PCT/CN2019/130550 CN2019130550W WO2020238209A1 WO 2020238209 A1 WO2020238209 A1 WO 2020238209A1 CN 2019130550 W CN2019130550 W CN 2019130550W WO 2020238209 A1 WO2020238209 A1 WO 2020238209A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- data
- audio data
- human voice
- monophonic
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Definitions
- This application relates to audio processing methods, systems and related equipment.
- Speaker Diarization refers to the audio data in a multi-person dialogue, divided according to the speaker, and labeled process.
- the existing speaker separation system achieves a practical level of accuracy in a clean near-field environment, but when the environment is relatively complex, the important speaker independent single-channel speech separation has low separation accuracy.
- an audio processing method, system and related equipment are provided.
- An audio processing method including: obtaining audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and obtaining the human voice audio data according to the audio data And a switching point, wherein the human voice audio data is the human voice generated when the n interlocutors are engaged in a conversation after the noise is removed from the audio data, and the switching point is any of the n interlocutors
- the data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups Belong to the same interlocutor.
- An audio processing method includes: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; and converting the audio data by a voice recognition method Is the corresponding text data; according to the audio data and text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice data is converted into m monophonic data, wherein the m Each monophonic data in the monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, wherein the n audio groups The monophonic data of each audio group in the audio group belongs to the same interlocutor; the interlocutor to which each audio group in the
- An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; the acquisition unit, The obtaining unit is configured to obtain human voice audio data and a switching point according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the audio data.
- any one of the n interlocutors switches to another interlocutor’s conversation time point; a conversion unit, the conversion unit is used for switching points, and converts the human voice audio data into m monophonic data, wherein , Each of the m monophonic data only contains the human voice of a single interlocutor, and m is a positive integer; a clustering unit, the clustering unit is used to cluster the m monophonic data , Obtain n audio groups, wherein the monophonic data in each audio group in the n audio groups belong to the same interlocutor.
- An audio processing system includes: an acquisition unit configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer; a first conversion Unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method; the obtaining unit is used to obtain human voice audio data and switching points according to the audio data and audio text, Wherein, the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor A second conversion unit, the second conversion unit is used to convert the human voice audio data into m monophonic data according to the human voice audio data and switching points, wherein the m monophonic data Each monophonic data of only includes the human voice of a single interlocutor, m is a positive integer; clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, where, The monophonic
- a server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and the human voice produced by n interlocutors in a dialogue, where n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the The audio data removes the noise and obtains the human voice generated by n interlocutors during a conversation, and the switching point is the conversation time point when any interlocutor of the n interlocutors switches to another interlocutor; The switching point converts the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The m monophonic data are clustered to obtain n audio groups, wherein the monophonic data of each audio group in
- a server includes a processor and a memory, the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions: acquiring audio data, wherein: The audio data includes noise and human voices produced by n interlocutors during a conversation, where n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; according to the audio data and text data, Obtain human voice audio data and a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is any one of the n interlocutors switching To another interlocutor’s dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each of the m monophonic data includes only a single dialogue The human voice of the speaker, m is a positive integer; cluster the m monophonic data to obtain n audio groups, where the monophonic data of each audio group in the n audio groups belong to the
- a computer non-transitory storage medium wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the audio data minus the The human voice generated when n interlocutors are engaged in a conversation obtained after the noise, the switching point is the conversation time point when any one of the n interlocutors switches to another interlocutor; according to the switching point, Convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; The acoustic data is clustered to obtain n audio groups, where the mono
- a computer non-transitory storage medium wherein the computer non-transitory storage medium stores a computer program, and the computer program is characterized in that the following steps are implemented when the computer program is executed by a computing device: acquiring audio data, wherein the audio data is Including noise and the human voice generated by n interlocutors in a dialogue, n is a positive integer; the audio data is converted into corresponding text data through a voice recognition method; the human voice audio data is obtained according to the audio data and the text data And a switching point, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is that any one of the n interlocutors switches to another interlocutor The dialogue time point; according to the switching point, the human voice data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, m is a positive integer; clustering the m monophonic data to obtain n audio groups,
- a computer program product when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; according to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors obtained by removing the noise from the audio data ,
- the switching point is a dialogue time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the human voice audio data is converted into m monophonic data, Wherein, each monophonic data in the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer; clustering the m monophonic data to obtain n audio groups, where all The monophonic data of each audio group in the n audio groups belong to the same interlocutor.
- a computer program product when the computer program product is read and executed by a computer, the following steps are realized: acquiring audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, n is a positive integer; the audio data is converted into corresponding text data through the voice recognition method; the human voice audio data and the switching point are obtained according to the audio data and text data, wherein the human voice audio data is the The audio data is the audio data obtained after the noise is removed, and the switching point is the conversation time point at which any one of the n interlocutors switches to another interlocutor; according to the switching point, the person The sound data is converted into m monophonic data, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer; the m monophonic data are clustered , Obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocut
- Fig. 1 is a schematic structural diagram of a speaker separation system provided by the present application.
- Fig. 2 is a schematic flowchart of an audio processing method provided by the present application.
- FIG. 3 is a schematic flowchart of clustering m monophonic data into n audio groups in an application scenario provided by the present application.
- Fig. 4 is a detailed flowchart of an audio processing method provided by the present application.
- Fig. 5 is a schematic flowchart of another audio processing method provided by the present application.
- Fig. 6 is a detailed flowchart of another audio processing method provided by the crowd.
- Fig. 7 is a schematic structural diagram of an audio processing system provided by the present application.
- Fig. 8 is a schematic structural diagram of another audio processing system provided by the present application.
- Fig. 9 is a schematic structural diagram of a server provided by the present application.
- Speaker separation refers to the process of dividing and labeling audio data in a multi-person conversation according to the speakers.
- the speaker separation system is generally based on the Bayesian Information Criterion (BIC) as a similarity measure for speaker separation.
- BIC Bayesian Information Criterion
- the technology mainly passes through the input module 101 and the silence detection module in turn. 103.
- the silence detection module 102 is used to remove the silence part of the input audio data to obtain the second audio data;
- the speaker recognition module 103 learns the voiceprint characteristics of speakers in a large number of business scenarios, for example, the speaker separation system is used for Separate the voices of the customer service and the user, then the speaker recognition module will learn a large number of customer service and user voiceprint characteristics, such as the voice intonation characteristics, prosody characteristics of the customer service, etc., so that the speaker recognition module can, based on the second audio data,
- the identity of the speaker of each dialogue in the current audio data is determined to obtain the third audio data;
- the switching point detection module 104 is used to determine the dialogue switching point of the speaker according to the third audio data;
- the conversion module 105 is used to Switching points, thereby editing the third audio data into multiple pieces of audio data;
- the classification module 106 is used to classify the multiple pieces of audio data according to the identity of the speaker detected by the speaker recognition module 103 to obtain the speaker separation result.
- the system not only needs to train more models for business scene data (silence detection model, speaker recognition model, and switching point model) in the training phase, but also has a longer detection process in the detection phase, which needs to go through Figure 1 in turn. Only the seven modules shown can obtain the test results, which requires a lot of time.
- Fig. 2 is an audio processing method provided by the present application. The method includes the following steps:
- the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
- the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
- the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
- human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
- the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
- the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
- the switching point is any one of the n interlocutors.
- the person switches to the conversation time point of another interlocutor.
- the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: “I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
- the numbers in front of A and B represent the playback time of the current audio data, or the time axis of the audio data. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20.
- step S202 since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B.
- the above examples are only for illustration and cannot constitute a specific limitation.
- the obtaining human voice audio data and switching points according to the audio data includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice
- the acoustic separation model is a model obtained after training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data is input into the speaker change detection model (Speaker Change Detection, SCD) to obtain the The switching point.
- the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
- the human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
- VAD Voice Activity Detection
- step S202 since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers.
- the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
- the human voice audio data is converted into m monophonic data.
- the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
- each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
- the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
- each monophonic data has only one interlocutor’s human voice.
- the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
- the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
- the human voice included in the monophonic data is only user A or customer service B.
- FIG. 3 shows the monophonic data in the form of text.
- the monophonic data It is only audio data, and does not contain text information.
- the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: “I want to check the phone bill” and “Yes” corresponding to the 2 monophonic data, the other type of audio group is: “Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
- the method further includes: using a speech recognition method (Automatic Speech Recognition, ASR) to combine all The m monophonic data are converted into corresponding m text information; according to the m text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
- ASR Automatic Speech Recognition
- the interlocutor described in each audio group can be determined by keywords or key words in the text information.
- the interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc.
- This application does not specifically limit the setting of keywords.
- dynamic time warping Dynamic Time Warping, DTW
- Hidden Markov Model HMM
- vector quantization VQ
- artificial neural network Artificial Neural Network, ANN
- voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
- FIG. 4 shows a schematic flow chart of an audio processing method provided by the present application. It can be seen from Figure 4 that, compared with the speaker separation method in the embodiment of Figure 1, the audio processing method provided by the present application only needs to focus on services during the training phase.
- the scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced.
- the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation
- the model removes the interference of noise information, and the accuracy is greatly improved.
- the audio data by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m
- the monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice.
- the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
- multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
- FIG. 5 is another audio processing method provided by this application.
- the difference between the audio processing method shown in FIG. 5 and the audio processing method shown in FIG. 2 is that the process of voice recognition is performed in advance to obtain the text characteristics of the audio data, so that the switching point detection model can comprehensively consider the audio characteristics and The text feature can detect the interlocutor’s switching point more accurately.
- the audio processing method shown in FIG. 5 includes the following steps:
- the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
- the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
- the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
- human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
- the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
- the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
- the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
- the switching point is any one of the n interlocutors.
- the person switches to the conversation time point of another interlocutor.
- the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
- the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points.
- the above examples are only for illustration and do not constitute a specific limitation.
- the obtaining human voice audio data and switching points according to the audio data and audio text includes: inputting the audio data into a human voice separation model to obtain the human voice audio data, wherein
- the human voice separation model is a model obtained by training the neural network using known audio samples and corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain all The switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
- the human voice separation model may specifically be a VAD model, which uses a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, so as to obtain the human voice separation model.
- the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
- the two steps can be performed simultaneously, and then the switching point detection step is performed.
- the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously.
- the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
- the human voice audio data is converted into m monophonic data.
- the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
- the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
- the switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
- each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
- the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
- each monophonic data has only one interlocutor’s human voice.
- the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
- the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
- the human voice included in the monophonic data is only user A or customer service B.
- the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
- the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: “I think 2 monophonic data corresponding to "check call charge” and "yes”, the other type of audio group is: “Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan” corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
- the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait” to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges"
- the interlocutor of 1 is a user, and the interlocutor of category 2 is determined to be customer service, etc.
- the application does not specifically limit the setting of keywords.
- Fig. 6 shows a schematic flow chart of an audio processing method provided by this application. It can be seen from Fig. 6 that, compared with the speaker separation method in the embodiment of Fig. 1, the audio processing method provided by this application only needs to focus on services during the training phase.
- the scene data trains the human voice separation model and the switching point detection model. The time and labor costs of the training phase are greatly reduced.
- the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation, and the use of human voice separation
- the model removes the interference of noise information, and the switching point detection model integrates text features and audio features, which greatly improves the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
- audio data is acquired; the audio data is converted into corresponding text data through a voice recognition method; human voice audio data and switching points are obtained according to the audio data and text data; according to the switching points, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups belongs.
- the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
- multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
- FIG. 7 is an audio processing system provided by the present application.
- the audio processing system 700 includes an acquisition unit 710, an acquisition unit 720, a conversion unit 730, and a clustering unit 740, wherein:
- the acquiring unit 710 is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
- the obtaining unit 720 is configured to obtain human voice audio data and a switching point according to the audio data, where the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching point is Any one of the n interlocutors switches to the conversation time point of another interlocutor;
- the conversion unit 730 is used for switching points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data only contains the human voice of a single interlocutor, m is a positive integer;
- the clustering unit 740 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same interlocutor .
- the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
- the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
- the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
- human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
- the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
- the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
- the switching point is any one of the n interlocutors.
- the person switches to the conversation time point of another interlocutor.
- the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: “I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
- the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20.
- step S202 since step S202 does not perform voice Therefore, the switching point is obtained only based on the audio data of the dialogue between A and B without knowing the content of the dialogue between A and B.
- the above examples are only for illustration and cannot constitute a specific limitation.
- the obtaining unit 720 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples and The model obtained by training the neural network on the corresponding known human voice data samples; at the same time, the audio data is input into the switching point detection model to obtain the switching point, wherein the switching point detection model uses the known audio The model obtained after training the neural network with the samples and the corresponding known switching point samples.
- the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
- the human voice separation model may specifically be a voice activity detection model (Voice Activity Detection, VAD), which adopts a similar event detection scheme to learn the distinguishing features between human voice and non-human voice, thereby obtaining a human voice separation model. It is understandable that the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
- VAD Voice Activity Detection
- step S202 since the input data of the human voice separation model and the switching point detection model are all audio data, the human voice separation can be performed in parallel with the switching point detection, and the two will not interfere with each other, thereby reducing the number of speakers.
- the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
- the human voice audio data is converted into m monophonic data.
- the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
- each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
- the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
- each monophonic data has only one interlocutor’s human voice.
- the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
- the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
- the human voice included in the monophonic data is only user A or customer service B.
- FIG. 3 shows the monophonic data in the form of text.
- the monophonic data It is only audio data, and does not contain text information.
- the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering the 4 monophonic data to obtain 2 audio groups, specifically as shown in Figure 3, one type of audio group is: “I want to check the phone bill” and “Yes” corresponding to the 2 monophonic data, the other type of audio group is: “Are you checking the phone bill of this phone", "Please wait, your phone bill balance is still 25 yuan The corresponding 2 monophonic data will finally convert the audio data containing the voices of n interlocutors into n audio groups.
- the system further includes a confirmation unit 750, which is configured to: after clustering the m monophonic data to obtain n audio groups, perform voice recognition Method: Convert the m monophonic data into corresponding m text information; according to the m text information, confirm the interlocutor to which each audio group of the n audio groups belongs.
- the interlocutor described in each audio group can be determined by keywords or key words in the text information.
- the interlocutor of category 2 is the customer service, and then the interlocutor of category 1 is determined as the user, or the interlocutor of category 1 is determined as the user according to the keywords "I want" and "check call charges", and the interlocutor of category 2 is determined as the customer service, etc.
- This application does not specifically limit the setting of keywords.
- dynamic time warping Dynamic Time Warping, DTW
- Hidden Markov Model HMM
- vector quantization VQ
- artificial neural network Artificial Neural Network, ANN
- voice recognition methods to convert the m monophonic data into corresponding m text information, which is not specifically limited in this application.
- the input data in the clustering process is m monophonic data
- the input data in the speech recognition process can also be m monophonic data
- the clustering and speech recognition processes can also be processed in parallel, thereby further Improve the entire process of speaker separation and improve user experience.
- FIG. 7 it can be seen from FIG. 7 that, compared with the speaker separation system in the embodiment of FIG. 1, the audio processing system provided by this application needs to train the human voice separation model and the switching point detection model only for the business scene data in the training phase. The cost of labor and labor is greatly reduced, and the steps in the detection stage can be processed in parallel, which greatly shortens the time required for speaker separation.
- the use of the human voice separation model to remove the interference of noise information greatly improves the accuracy.
- the audio data by obtaining audio data; obtaining human voice audio data and switching points according to the audio data; converting the human voice audio data into m monophonic data according to the switching points; converting the m
- the monophonic data is clustered to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor, thereby confirming that each audio group in the n audio groups The identity of the person's voice.
- the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
- multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
- Fig. 8 is another audio processing system provided by the present application.
- the difference between the audio processing system shown in Fig. 8 and the audio processing system shown in Fig. 7 is that the process of voice recognition is performed in advance to obtain audio data.
- Text features enable the switching point detection model to comprehensively consider audio features and text features, and more accurately detect the switch point of the interlocutor.
- the audio processing system 800 shown in FIG. 8 includes an acquisition unit 810, a first conversion unit 820, an acquisition unit 830, a second conversion unit 840, a clustering unit 850, and a confirmation unit 860, where
- the acquiring unit 810 is configured to acquire audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;
- the conversion unit 820 is configured to convert the audio data into audio text through a voice recognition method
- the obtaining unit 830 is configured to obtain human voice audio data and switching points according to the audio data and audio text, where the human voice audio data is audio data obtained by removing the noise from the audio data, and
- the switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;
- the second conversion unit 840 is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein each of the m monophonic data
- the data only includes the human voice of a single interlocutor, and m is a positive integer;
- the clustering unit 850 is configured to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;
- the confirmation unit 860 is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer.
- the audio data may specifically be audio files such as telephone recording or video recording that require speaker separation, for example, a recording file of a communication between a user and a customer service phone, a video recording file of a meeting, and so on.
- the noise concept in this application refers to non-target human voices that do not need to be separated from the speaker, and specifically can be ambient sound, silence, equipment noise generated by recording equipment, and so on.
- human voices may also appear in ambient sounds, such as the audio data generated by user A in a noisy restaurant and customer service B’s telephone communication. At this time, the voices of other people in the restaurant are also human voices, but they do not belong to the next need.
- the target human voice for speaker separation will therefore also be classified as noise. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
- the audio data can be converted into corresponding text data through DTW, HMM theory, VQ technology, ANN and other speech recognition methods, which are not specifically limited in this application.
- the human voice audio data is the human voice generated when the n interlocutors conduct a conversation after removing the noise from the audio data
- the switching point is any one of the n interlocutors.
- the person switches to the conversation time point of another interlocutor.
- the audio data of the dialogue between A and B can be expressed in text as: (00:12)A: "I want to check the phone bill” (00:14) B: “Are you checking the phone bill of this phone” (00:18) A: “Yes” (00:20) B: “Please wait a moment” (00:25 ) B: “You still have 25 yuan in your balance of call charges”.
- the numbers in front of A and B represent the playback time of the current recording. At this time, there are 3 switching points, which can be 00:14, 00:18 and 00:20. It should be understood that since the audio data has been voice recognized in step S502, the switching point detection model can comprehensively consider audio features and text features to obtain more accurate switching points.
- the above examples are only for illustration and do not constitute specific limitations.
- the obtaining unit 830 is configured to: input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses known audio samples And the model obtained by training the neural network with the corresponding known human voice data samples; at the same time, the audio data and audio text are input into the switching point detection model to obtain the switching point, wherein the switching point detection model is A model obtained by training the neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
- the human voice separation model can be a VAD model, which uses a similar event detection scheme to learn the distinguishing characteristics between human voice and non-human voice to obtain a human voice separation model.
- the detection result of VAD largely removes interference factors such as environmental noise, non-target human voice, equipment noise and so on. Compared with the detection of silent and non-silent in the embodiment of Figure 1, the final separation result obtained is more accurate. high.
- the two steps can be performed simultaneously, and then the switching point detection step is performed.
- the speech recognition step is performed first, and then the human voice separation step and the switching point detection step are performed simultaneously. So as to perform tasks in parallel as much as possible, and minimize the time required for the speaker separation process.
- the switching point is the conversation time point obtained based on the time axis of the audio data, so the time axis of the human voice audio data after noise removal is still the same as the audio data, so that the next time is based on the same time axis, using the switching point to change
- the human voice audio data is converted into m monophonic data.
- the switching point detection model and the human voice separation model can also make full use of labeled data related to the business scenario for training, thereby further improving the accuracy of switching point detection and human voice separation.
- the switching point detection model in this embodiment of the application is a model obtained after training a neural network using known audio samples, known audio text samples, and corresponding known switching point samples.
- the switching point detection model can be integrated Taking into account audio features and text features, the switch point of the interlocutor is detected more accurately, and the accuracy of speaker separation is further improved.
- each of the m monophonic data includes only the human voice of a single interlocutor, and m is a positive integer.
- the audio data can be converted into 4 mono data, which can be shown in FIG. 3.
- each monophonic data has only one interlocutor’s human voice.
- the monophonic data can be the audio data corresponding to the sentence "I want to check the phone bill", or "please wait” or "Your phone bill”
- the audio data corresponding to the two sentences of 25 yuan in the balance is also the audio data corresponding to the two sentences of 25 yuan in the balance.
- the human voice included in the monophonic data is only user A or customer service B.
- the text data can also be converted into m monophonic texts, and each monophonic text corresponds to the monophonic data, as shown in FIG. 3, for subsequent clustering .
- the monophonic data of each audio group in the n audio groups belong to the same interlocutor. Still taking the above example as an example, after obtaining 4 monophonic data, clustering these 4 monophonic data to obtain 2 audio groups, which can be specifically shown in Figure 3, one type of audio group is: “I think 2 monophonic data corresponding to "check call charge” and "yes”, the other type of audio group is: “Are you checking the phone charge of this phone", "Please wait, your call charge balance is 25 yuan” corresponding 2 monophonic data, finally converting the audio data containing n interlocutor's vocals into n audio groups.
- the interlocutor in each audio group can be determined by the keywords or key words in the mono text corresponding to each mono data. For example, taking the above example as an example, you can use the keyword " Excuse me, "Please wait” to determine the interlocutor of category 2 shown in Figure 3 as the customer service, and then determine the interlocutor of category 1 as the user, or determine the category shown in Figure 3 according to the keywords "I want to" and "check call charges" The interlocutor of 1 is a user, and then the interlocutor of category 2 is determined to be customer service, etc. This application does not specifically limit the setting of keywords.
- the audio processing system provided by this application only needs to train the human voice separation model and the switching point detection model for the business scene data in the training phase.
- the time and labor costs of the training phase are greatly reduced.
- the steps in the detection phase can be processed in parallel, which greatly reduces the time required for speaker separation.
- the human voice separation model is used to remove the interference of noise information, and the switching point detection model is integrated.
- the text features and audio features greatly improve the accuracy of switching point detection, thereby improving the accuracy of speaker separation.
- the audio data is converted into corresponding text data; according to the audio data and text data, the human voice audio data and the switching point are obtained; according to the switching point, Convert the human voice audio data into m monophonic data; cluster the m monophonic data to obtain n audio groups, wherein the monophonic sound of each of the n audio groups belongs.
- the audio data can greatly remove noise and other interference factors, thereby improving the accuracy of speaker separation in complex environments.
- multiple steps in the detection phase can be processed in parallel, simplifying speech The process of separation of the people, and the speed of separation is improved.
- FIG. 9 is a schematic structural diagram of a server provided in this application.
- the server may implement the method in the embodiment in FIG. 2 or the embodiment in FIG. 5.
- the data processing method provided in this application can be implemented in a cloud service cluster as shown in FIG. 9 or in a single computer node and a storage node. This application is not specifically limited.
- the cloud service cluster includes At least one computing node 910 and at least one storage node 920.
- the computing node 910 includes one or more processors 911, a communication interface 912, and a memory 913.
- the processor 911, the communication interface 912, and the memory 913 may be connected through a bus 914.
- the processor 911 includes one or more general-purpose processors.
- the general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on. It can be a dedicated processor used only for the computing node 910 or can be shared with other computing nodes 910.
- the processor 911 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 913, which enables the computing node 910 to provide a wide variety of services.
- the processor 911 can execute codes of modules such as clustering, human voice separation, and switching point detection to execute at least a part of the methods discussed herein.
- the communication interface 912 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users.
- the communication interface 912 may adopt a protocol family over TCP/IP, for example, RAAS protocol, Remote Function Call (RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (SNMP) protocol, Common Object Request Broker Architecture (CORBA) protocol, distributed protocol, etc.
- the memory 913 may include a volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read-Only Memory).
- the storage node 920 includes one or more storage controllers 921 and a storage array 922. Among them, the storage controller 921 and the storage array 922 may be connected through a bus 924.
- the storage controller 921 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait.
- each storage node includes a storage controller. In other embodiments, multiple storage nodes may also share a storage controller, which is not specifically limited here.
- the memory array 922 may include a plurality of memories 923.
- the memory 923 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory.
- the storage array may be composed of multiple HDDs or multiple SDDs, or the storage array may be composed of HDDs and SDDs.
- the memory array 923 may include one or more data centers. Multiple data centers can be set up at the same location, or at different locations, and there is no specific limitation here.
- the memory array 923 may store program codes and program data.
- the program code includes voice recognition module code, semantic understanding module code, order production module code, and so on.
- the program data includes: clustering module code data, human voice separation module code data, switching point detection module code data, etc., and can also include subscription human voice separation models, switching point detection models, and corresponding training sample set data, etc.
- the application is not specifically limited.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
- the usable medium may be a magnetic medium, (for example, a floppy disk, a storage disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).
- the program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments.
- the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
Abstract
Description
Claims (19)
- 一种音频处理的方法,其特征在于,所述方法包括:An audio processing method, characterized in that the method includes:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的n个对话者进行对话时产生的人声,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data, the human voice audio data and the switching point are obtained, wherein the human voice audio data is the human voice generated by the n interlocutors after the noise is removed from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;根据所述切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice audio data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者。Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor.
- 根据权利要求1所述的方法,其特征在于,在所述将所述m个单声数据进行聚类,得到n个音频组之后,所述方法还包括:The method according to claim 1, wherein after clustering the m monophonic data to obtain n audio groups, the method further comprises:通过语音识别方法,将所述m个单声数据转换为相应的m个文本信息;Converting the m monophonic data into corresponding m text information through a voice recognition method;根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
- 根据权利要求1所述的方法,其特征在于,所述根据所述音频数据,获得人声音频数据以及切换点包括:The method according to claim 1, wherein the obtaining human voice audio data and switching points according to the audio data comprises:将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is a model obtained after training a neural network using known audio samples and corresponding known human voice data samples ;Simultaneously,将所述音频数据输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
- 一种音频处理的方法,其特征在于,所述方法包括:An audio processing method, characterized in that the method includes:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- 根据权利要求4所述的方法,其特征在于,所述根据所述音频数据以及音频文本, 获得人声音频数据以及切换点包括:The method according to claim 4, wherein the obtaining human voice audio data and switching points according to the audio data and audio text comprises:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
- 一种音频处理系统,其特征在于,所述系统包括:An audio processing system, characterized in that the system includes:获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包含噪声和n个对话者进行对话时产生的人声,n是正整数;An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;获得单元,所述获得单元用于根据所述音频数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;An obtaining unit, the obtaining unit is configured to obtain human voice audio data and switching points according to the audio data, wherein the human voice audio data is audio data obtained by removing the noise from the audio data, and the switching The point is the conversation time point at which any one of the n interlocutors switches to another interlocutor;转换单元,所述转换单元用于切换点,将所述人声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包含单个对话者的人声,m是正整数;A conversion unit, which is used to switch points to convert the human voice audio data into m monophonic data, wherein each monophonic data in the m monophonic data contains only a single interlocutor’s person Sound, m is a positive integer;聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组中的单声数据属于同一个对话者。A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data in each audio group of the n audio groups belong to the same An interlocutor.
- 根据权利要求6所述的系统,其特征在于,所述系统还包括:The system of claim 6, wherein the system further comprises:确认单元,所述确认单元用于通过语音识别方法,将所述m个单声数据转换为相应的m个文本信息;A confirmation unit, which is used to convert the m monophonic data into corresponding m text information through a voice recognition method;根据所述m个文本信息,确认所述n个音频组中每个音频组所属的对话者。According to the m pieces of text information, the interlocutor to which each audio group of the n audio groups belongs is confirmed.
- 根据权利要求6所述的系统,其特征在于,所述获得单元还用于将所述音频数据输入人声分离模型,获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The system according to claim 6, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model uses Know the audio samples and the corresponding known human voice data samples after training the neural network model; at the same time,将所述音频数据输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data is input into a switching point detection model to obtain the switching point, wherein the switching point detection model is a model obtained by training a neural network using known audio samples and corresponding known switching point samples.
- 一种音频处理系统,其特征在于,所述系统包括:An audio processing system, characterized in that the system includes:获取单元,所述获取单元用于获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;An acquiring unit, the acquiring unit is configured to acquire audio data, wherein the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;第一转换单元,所述转换单元用于通过语音识别方法,将所述音频数据转换为音频文本;A first conversion unit, the conversion unit is used to convert the audio data into audio text through a voice recognition method;获得单元,所述获得单元用于根据所述音频数据以及音频文本,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;An obtaining unit for obtaining human voice audio data and switching points according to the audio data and audio text, wherein the human voice audio data is audio data obtained by removing the noise from the audio data; The switching point is a conversation time point at which any one of the n interlocutors switches to another interlocutor;第二转换单元,所述第二转换单元用于根据所述人声音频数据以及切换点,将所述人 声音频数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;The second conversion unit, the second conversion unit is configured to convert the human voice audio data into m monophonic data according to the human voice audio data and the switching point, wherein among the m monophonic data Each monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;聚类单元,所述聚类单元用于将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;A clustering unit, the clustering unit is used to cluster the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same one interlocutor;确认单元,所述确认单元用于根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。The confirmation unit is configured to confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- 根据权利要求9所述的系统,其特征在于,所述获得单元还用于将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The system according to claim 9, wherein the obtaining unit is further configured to input the audio data into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is used The model obtained by training the neural network with known audio samples and corresponding known human voice data samples; at the same time,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
- 一种服务器,其特征在于,包括处理器以及存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行如权利要求1至3任一项所述的方法。A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the instructions when executing any one of claims 1 to 3. The method described in the item.
- 一种服务器,其特征在于,包括处理器以及存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,所述处理器执行所述指令时执行以下步骤:A server, characterized by comprising a processor and a memory; the memory is used to store instructions, the processor is used to execute the instructions, and the processor executes the following steps when executing the instructions:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- 根据权利要求12所述的服务器,其特征在于,所述处理器执行所述指令时还执行以下步骤:The server according to claim 12, wherein the processor further executes the following steps when executing the instruction:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对 神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
- 一种计算机非瞬态存储介质,其特征在于,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现如权利要求1至3任一项所述的方法。A computer non-transitory storage medium, characterized in that the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the computer program is implemented as described in any one of claims 1 to 3. The method described.
- 一种计算机非瞬态存储介质,其特征在于,所述计算机非瞬态存储介质存储有计算机程序,其特征在于,所述计算机程序被计算设备执行时实现以下步骤:A computer non-transitory storage medium, wherein the computer non-transitory storage medium stores a computer program, and is characterized in that, when the computer program is executed by a computing device, the following steps are implemented:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- 根据权利要求15所述的计算机非瞬态存储介质,其特征在于,所述计算机程序被计算设备执行时还实现以下步骤:The computer non-transitory storage medium according to claim 15, wherein the computer program further implements the following steps when being executed by a computing device:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
- 一种计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,实现如权利要求1至3中任一项所述的方法。A computer program product, characterized in that, when the computer program product is read and executed by a computer, the method according to any one of claims 1 to 3 is implemented.
- 一种计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,实现以下步骤:A computer program product, characterized in that, when the computer program product is read and executed by a computer, the following steps are implemented:获取音频数据,其中,所述音频数据中包括噪声和n个对话者进行对话时产生的人声,n是正整数;Acquiring audio data, where the audio data includes noise and human voices generated when n interlocutors conduct a conversation, and n is a positive integer;通过语音识别方法,将所述音频数据转换为相应的文本数据;Converting the audio data into corresponding text data through a voice recognition method;根据所述音频数据以及文本数据,获得人声音频数据以及切换点,其中,所述人声音频数据为所述音频数据去掉所述噪声后得到的音频数据,所述切换点为所述n个对话者中的任一个对话者切换到另一个对话者的对话时间点;According to the audio data and the text data, human voice audio data and switching points are obtained, wherein the human voice audio data is the audio data obtained by removing the noise from the audio data, and the switching points are the n When any one of the interlocutors switches to another interlocutor;根据所述切换点,将所述人声数据转换为m个单声数据,其中,所述m个单声数据中的每个单声数据只包括单个对话者的人声,m是正整数;Converting the human voice data into m monophonic data according to the switching point, wherein each monophonic data in the m monophonic data only includes the human voice of a single interlocutor, and m is a positive integer;将所述m个单声数据进行聚类,得到n个音频组,其中,所述n个音频组中的每个音频组的单声数据属于同一个对话者;Clustering the m monophonic data to obtain n audio groups, wherein the monophonic data of each audio group in the n audio groups belong to the same interlocutor;根据所述音频文本确认所述n个音频组中每个音频组所属的对话者。Confirm the interlocutor to which each audio group of the n audio groups belongs according to the audio text.
- 根据权利要求18所述的计算机程序产品,其特征在于,当所述计算机程序产品被计算机读取并执行时,还实现以下步骤:The computer program product according to claim 18, wherein when the computer program product is read and executed by a computer, the following steps are further implemented:将所述音频数据输入人声分离模型,从而获得所述人声音频数据,其中,所述人声分离模型是使用已知音频样本以及对应的已知人声数据样本对神经网络进行训练后得到的模型;同时,The audio data is input into a human voice separation model to obtain the human voice audio data, wherein the human voice separation model is obtained after training a neural network using known audio samples and corresponding known human voice data samples Model; at the same time,将所述音频数据以及音频文本输入切换点检测模型,从而获得所述切换点,其中,所述切换点检测模型是使用已知音频样本、已知音频文本样本以及对应的已知切换点样本对神经网络进行训练后得到的模型。The audio data and audio text are input into a switching point detection model to obtain the switching point, wherein the switching point detection model uses known audio samples, known audio text samples, and corresponding pairs of known switching point samples The model obtained after the neural network is trained.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910453493.7A CN110335621A (en) | 2019-05-28 | 2019-05-28 | Method, system and the relevant device of audio processing |
CN201910453493.7 | 2019-05-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020238209A1 true WO2020238209A1 (en) | 2020-12-03 |
Family
ID=68140272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/130550 WO2020238209A1 (en) | 2019-05-28 | 2019-12-31 | Audio processing method, system and related device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110335621A (en) |
WO (1) | WO2020238209A1 (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110335621A (en) * | 2019-05-28 | 2019-10-15 | 深圳追一科技有限公司 | Method, system and the relevant device of audio processing |
CN110930989B (en) * | 2019-11-27 | 2021-04-06 | 深圳追一科技有限公司 | Speech intention recognition method and device, computer equipment and storage medium |
CN111243595B (en) * | 2019-12-31 | 2022-12-27 | 京东科技控股股份有限公司 | Information processing method and device |
CN111968679B (en) * | 2020-10-22 | 2021-01-29 | 深圳追一科技有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
CN112562644A (en) * | 2020-12-03 | 2021-03-26 | 云知声智能科技股份有限公司 | Customer service quality inspection method, system, equipment and medium based on human voice separation |
CN112669855A (en) * | 2020-12-17 | 2021-04-16 | 北京沃东天骏信息技术有限公司 | Voice processing method and device |
CN112735384A (en) * | 2020-12-28 | 2021-04-30 | 科大讯飞股份有限公司 | Turning point detection method, device and equipment applied to speaker separation |
CN112802498B (en) * | 2020-12-29 | 2023-11-24 | 深圳追一科技有限公司 | Voice detection method, device, computer equipment and storage medium |
CN112289323B (en) * | 2020-12-29 | 2021-05-28 | 深圳追一科技有限公司 | Voice data processing method and device, computer equipment and storage medium |
CN112966082A (en) * | 2021-03-05 | 2021-06-15 | 北京百度网讯科技有限公司 | Audio quality inspection method, device, equipment and storage medium |
CN113593578A (en) * | 2021-09-03 | 2021-11-02 | 北京紫涓科技有限公司 | Conference voice data acquisition method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8543402B1 (en) * | 2010-04-30 | 2013-09-24 | The Intellisis Corporation | Speaker segmentation in noisy conversational speech |
CN107578770A (en) * | 2017-08-31 | 2018-01-12 | 百度在线网络技术(北京)有限公司 | Networking telephone audio recognition method, device, computer equipment and storage medium |
US20180158451A1 (en) * | 2016-12-01 | 2018-06-07 | International Business Machines Corporation | Prefix methods for diarization in streaming mode |
CN108399923A (en) * | 2018-02-01 | 2018-08-14 | 深圳市鹰硕技术有限公司 | More human hairs call the turn spokesman's recognition methods and device |
US20180286409A1 (en) * | 2017-03-31 | 2018-10-04 | International Business Machines Corporation | Speaker diarization with cluster transfer |
CN108922538A (en) * | 2018-05-29 | 2018-11-30 | 平安科技(深圳)有限公司 | Conferencing information recording method, device, computer equipment and storage medium |
CN109300470A (en) * | 2018-09-17 | 2019-02-01 | 平安科技(深圳)有限公司 | Audio mixing separation method and audio mixing separator |
CN110335621A (en) * | 2019-05-28 | 2019-10-15 | 深圳追一科技有限公司 | Method, system and the relevant device of audio processing |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100466671C (en) * | 2004-05-14 | 2009-03-04 | 华为技术有限公司 | Method and device for switching speeches |
JP5103907B2 (en) * | 2005-01-17 | 2012-12-19 | 日本電気株式会社 | Speech recognition system, speech recognition method, and speech recognition program |
CN101452704B (en) * | 2007-11-29 | 2011-05-11 | 中国科学院声学研究所 | Speaker clustering method based on information transfer |
CN105161093B (en) * | 2015-10-14 | 2019-07-09 | 科大讯飞股份有限公司 | A kind of method and system judging speaker's number |
CN105895078A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method used for dynamically selecting speech model and device |
US9942518B1 (en) * | 2017-02-28 | 2018-04-10 | Cisco Technology, Inc. | Group and conversational framing for speaker tracking in a video conference system |
CN108766459B (en) * | 2018-06-13 | 2020-07-17 | 北京联合大学 | Target speaker estimation method and system in multi-user voice mixing |
CN109634692A (en) * | 2018-10-23 | 2019-04-16 | 蔚来汽车有限公司 | Vehicle-mounted conversational system and processing method and system for it |
CN109741754A (en) * | 2018-12-10 | 2019-05-10 | 上海思创华信信息技术有限公司 | A kind of conference voice recognition methods and system, storage medium and terminal |
CN109545228A (en) * | 2018-12-14 | 2019-03-29 | 厦门快商通信息技术有限公司 | A kind of end-to-end speaker's dividing method and system |
-
2019
- 2019-05-28 CN CN201910453493.7A patent/CN110335621A/en active Pending
- 2019-12-31 WO PCT/CN2019/130550 patent/WO2020238209A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8543402B1 (en) * | 2010-04-30 | 2013-09-24 | The Intellisis Corporation | Speaker segmentation in noisy conversational speech |
US20180158451A1 (en) * | 2016-12-01 | 2018-06-07 | International Business Machines Corporation | Prefix methods for diarization in streaming mode |
US20180286409A1 (en) * | 2017-03-31 | 2018-10-04 | International Business Machines Corporation | Speaker diarization with cluster transfer |
CN107578770A (en) * | 2017-08-31 | 2018-01-12 | 百度在线网络技术(北京)有限公司 | Networking telephone audio recognition method, device, computer equipment and storage medium |
CN108399923A (en) * | 2018-02-01 | 2018-08-14 | 深圳市鹰硕技术有限公司 | More human hairs call the turn spokesman's recognition methods and device |
CN108922538A (en) * | 2018-05-29 | 2018-11-30 | 平安科技(深圳)有限公司 | Conferencing information recording method, device, computer equipment and storage medium |
CN109300470A (en) * | 2018-09-17 | 2019-02-01 | 平安科技(深圳)有限公司 | Audio mixing separation method and audio mixing separator |
CN110335621A (en) * | 2019-05-28 | 2019-10-15 | 深圳追一科技有限公司 | Method, system and the relevant device of audio processing |
Also Published As
Publication number | Publication date |
---|---|
CN110335621A (en) | 2019-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020238209A1 (en) | Audio processing method, system and related device | |
US10902843B2 (en) | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier | |
US10249292B2 (en) | Using long short-term memory recurrent neural network for speaker diarization segmentation | |
US10748531B2 (en) | Management layer for multiple intelligent personal assistant services | |
US20230019978A1 (en) | Automatic speech recognition correction | |
US10516782B2 (en) | Conference searching and playback of search results | |
EP3254455B1 (en) | Selective conference digest | |
US9293133B2 (en) | Improving voice communication over a network | |
US20180336902A1 (en) | Conference segmentation based on conversational dynamics | |
US10811005B2 (en) | Adapting voice input processing based on voice input characteristics | |
US20180027351A1 (en) | Optimized virtual scene layout for spatial meeting playback | |
WO2016126813A2 (en) | Scheduling playback of audio in a virtual acoustic space | |
WO2022105861A1 (en) | Method and apparatus for recognizing voice, electronic device and medium | |
Triantafyllopoulos et al. | Deep speaker conditioning for speech emotion recognition | |
US20200004878A1 (en) | System and method for generating dialogue graphs | |
US11270691B2 (en) | Voice interaction system, its processing method, and program therefor | |
CN111798833A (en) | Voice test method, device, equipment and storage medium | |
US10762906B2 (en) | Automatically identifying speakers in real-time through media processing with dialog understanding supported by AI techniques | |
CN108877779B (en) | Method and device for detecting voice tail point | |
WO2023048746A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
CN113779208A (en) | Method and device for man-machine conversation | |
CN113129866B (en) | Voice processing method, device, storage medium and computer equipment | |
WO2019155716A1 (en) | Information processing device, information processing system, information processing method, and program | |
CN111354350A (en) | Voice processing method and device, voice processing equipment and electronic equipment | |
US20220201121A1 (en) | System, method and apparatus for conversational guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19931010 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19931010 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.04.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19931010 Country of ref document: EP Kind code of ref document: A1 |