CN110335621A - Method, system and the relevant device of audio processing - Google Patents

Method, system and the relevant device of audio processing Download PDF

Info

Publication number
CN110335621A
CN110335621A CN201910453493.7A CN201910453493A CN110335621A CN 110335621 A CN110335621 A CN 110335621A CN 201910453493 A CN201910453493 A CN 201910453493A CN 110335621 A CN110335621 A CN 110335621A
Authority
CN
China
Prior art keywords
audio
data
voice
audio data
monophone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910453493.7A
Other languages
Chinese (zh)
Inventor
周维聪
涂臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chase Technology Co Ltd
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Chase Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Chase Technology Co Ltd filed Critical Shenzhen Chase Technology Co Ltd
Priority to CN201910453493.7A priority Critical patent/CN110335621A/en
Publication of CN110335621A publication Critical patent/CN110335621A/en
Priority to PCT/CN2019/130550 priority patent/WO2020238209A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

This application provides the method for audio processing, system and relevant devices.The described method includes: obtaining audio data;According to the audio data, voice audio data and switching point are obtained;According to the switching point, the voice audio data is converted into m monophone data;The m monophone data are clustered, n audio group is obtained, wherein the monophone data of each audio group in the n audio group belong to the same interlocutor.

Description

Method, system and the relevant device of audio processing
Technical field
This application involves the method for field of artificial intelligence more particularly to audio processing, system and relevant devices.
Background technique
Currently, voice technology has incorporated people's life, brings the more convenient and fast life style of the mankind.With audio signal processing technique Continuous improvement, from the audio data of magnanimity (such as telephonograph, news broadcast, session recording) obtain it is interested specific Voice has become a big research hotspot.Wherein, one of the method for obtaining specific voice is exactly to separate (Speaker by words person Diarization, SD) system realization, words person's separation refers to the audio data from more people dialogue, carries out according to speaker It divides, and the process marked.
Person's separation system if existing, accuracy rate has reached realistic scale in the environment of clean near field, but in environment When relative complex, the important independent single-channel voice of speaker is separated, and separation accuracy is lower.
Summary of the invention
This application provides the method for audio processing, system and relevant devices, for solving current words person's separation system The low problem of separation accuracy.
In a first aspect, providing a kind of method of audio processing, which comprises
Obtain audio data, wherein include the people generated when noise and n interlocutor engage in the dialogue in the audio data Sound, n are positive integers;
According to the audio data, voice audio data and switching point are obtained, wherein the voice audio data is institute It states audio data and removes the voice generated when the n interlocutor obtained after the noise engages in the dialogue, the switching point is the n Any one interlocutor in a interlocutor is switched to the talk time point of another interlocutor;
According to the switching point, the voice audio data is converted into m monophone data, wherein the m monophone number Each monophone data in only include the voice of single interlocutor, and m is positive integer;
The m monophone data are clustered, n audio group is obtained, wherein each sound in the n audio group The monophone data of frequency group belong to the same interlocutor.
Optionally, the m monophone data are clustered described, after obtaining n audio group, the method is also wrapped It includes: by audio recognition method, the m monophone data being converted into corresponding m text information;According to the m text Information confirms interlocutor belonging to each audio group in the n audio group.
Optionally, described according to the audio data, it obtains voice audio data and switching point includes: by the audio Data input voice disjunctive model, obtain the voice audio data, wherein the voice disjunctive model is using known audio The model that sample and corresponding known voice data sample obtain after being trained to neural network;Meanwhile by the audio Data input switching point detection model, to obtain the switching point, wherein the switching point detection model is using known sound case The model that frequency sample and corresponding known switching point sample obtain after being trained to neural network.
Second aspect provides the method for another audio processing, which comprises
Obtain audio data, wherein include the people generated when noise and n interlocutor engage in the dialogue in the audio data Sound, n are positive integers;
By audio recognition method, the audio data is converted into corresponding text data;
According to the audio data and text data, voice audio data and switching point are obtained, wherein the voice Audio data is that the audio data removes the audio data obtained after the noise, and the switching point is the n interlocutor In any one interlocutor be switched to the talk time point of another interlocutor;
According to the switching point, the voice data are converted into m monophone data, wherein in the m monophone data Each monophone data only include the voice of single interlocutor, m is positive integer;
The m monophone data are clustered, n audio group is obtained, wherein each sound in the n audio group The monophone data of frequency group belong to the same interlocutor;
According to the audio text confirm each audio group in the n audio group belonging to interlocutor.
Optionally, described according to the audio data and audio text, obtain voice audio data and switching point packet It includes: the audio data being inputted into voice disjunctive model, to obtain the voice audio data, wherein the voice separation Model is the mould obtained after being trained using known audio sample and corresponding known voice data sample to neural network Type;Meanwhile by the audio data and audio text input switching point detection model, to obtain the switching point, wherein The switching point detection model is using known audio sample, known audio samples of text and corresponding known switching point sample The model obtained after being trained to neural network.
The third aspect provides a kind of audio processing system, the system comprises:
Acquiring unit, the acquiring unit is for obtaining audio data, wherein includes noise and n in the audio data The voice that a interlocutor generates when engaging in the dialogue, n are positive integers;
Obtaining unit, the obtaining unit are used to obtain voice audio data and switching point according to the audio data, Wherein, the voice audio data is that the audio data removes the audio data obtained after the noise, and the switching point is Any one interlocutor in the n interlocutor is switched to the talk time point of another interlocutor;
Converting unit, the converting unit are used for switching point, the voice audio data are converted to m monophone data, Wherein, each monophone data in the m monophone data only include the voice of single interlocutor, and m is positive integer;
Cluster cell, the cluster cell are used to cluster the m monophone data, obtain n audio group, In, the monophone data in each audio group in the n audio group belong to the same interlocutor.
Optionally, the system also includes confirmation unit, the confirmation unit is used for: described by the m monophone number According to being clustered, after obtaining n audio group, by audio recognition method, the m monophone data are converted into corresponding m A text information;According to the m text information, interlocutor belonging to each audio group in the n audio group is confirmed.
Optionally, the obtaining unit is used for: the audio data being inputted voice disjunctive model, obtains the voice sound Frequency evidence, wherein the voice disjunctive model is using known audio sample and corresponding known voice data sample to mind The model obtained after network is trained;Meanwhile the audio data is inputted into switching point detection model, thus described in obtaining Switching point, wherein the switching point detection model is using known audio sample and corresponding known switching point sample to mind The model obtained after network is trained.
Fourth aspect, this application provides a kind of audio processing system, the system comprises:
Acquiring unit, the acquiring unit is for obtaining audio data, wherein includes noise and n in the audio data The voice that a interlocutor generates when engaging in the dialogue, n are positive integers;
First converting unit, the converting unit are used to that the audio data to be converted to sound by audio recognition method Frequency text;
Obtaining unit, the obtaining unit are used to obtain voice audio number according to the audio data and audio text Accordingly and switching point, wherein the voice audio data is that the audio data removes the audio data obtained after the noise, The switching point is switched to the talk time point of another interlocutor for any one interlocutor in the n interlocutor;
Second converting unit, second converting unit are used for according to the voice audio data and switching point, by institute It states voice audio data and is converted to m monophone data, wherein each monophone data in the m monophone data only include list The voice of a interlocutor, m are positive integers;
Cluster cell, the cluster cell are used to cluster the m monophone data, obtain n audio group, In, the monophone data of each audio group in the n audio group belong to the same interlocutor;
Confirmation unit, the confirmation unit are used to confirm each audio in the n audio group according to the audio text Interlocutor belonging to group.
Optionally, the obtaining unit is used for: the audio data being inputted voice disjunctive model, to obtain the people Sound audio data, wherein the voice disjunctive model is using known audio sample and corresponding known voice data sample The model obtained after being trained to neural network;Meanwhile the audio data and audio text input switching point being detected Model, to obtain the switching point, wherein the switching point detection model is using known audio sample, known audio text The model that this sample and corresponding known switching point sample obtain after being trained to neural network.
5th aspect provides a kind of server, and the server includes processor and memory, and the memory is used In store instruction, the processor executes above-mentioned first party when executing described instruction for executing described instruction, the processor The method of face description.
6th aspect provides a kind of server, and the server includes processor and memory, and the memory is used In store instruction, the processor executes above-mentioned second party when executing described instruction for executing described instruction, the processor The method of face description.
7th aspect provides a kind of computer non-transitory storage media, the computer non-transitory storage media storage There is computer program, which is characterized in that the computer program realizes what above-mentioned first aspect described when being executed by calculating equipment Method.
Eighth aspect provides a kind of computer non-transitory storage media, the computer non-transitory storage media storage There is computer program, which is characterized in that the computer program realizes what above-mentioned second aspect described when being executed by calculating equipment Method.
9th aspect, provides a kind of computer program product, when the computer program product is readable by a computer simultaneously When execution, the method that above-mentioned first aspect describes is realized.
Tenth aspect, provides a kind of computer program product, when the computer program product is readable by a computer simultaneously When execution, the method that above-mentioned second aspect describes is realized.
Method, system and relevant device based on audio processing provided by the present application, by obtaining audio data;According to institute Audio data is stated, voice audio data and switching point are obtained;According to the switching point, the voice audio data is converted to M monophone data;The m monophone data are clustered, n audio group is obtained, wherein is in the n audio group every The monophone data of a audio group belong to the same interlocutor, to confirm people belonging to each audio group in the n audio group The identity of sound.Using the method, system and relevant device of audio processing provided by the present application, audio data passes through voice splitting die Type can dramatically eliminate the disturbing factors such as noise, and then improve the accuracy that words person separates under complex environment, and And multiple steps of detection-phase can parallel processing, simplify words person separation process, improve separating rate.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is a kind of structural schematic diagram of words person's separation system provided by the present application;
Fig. 2 is a kind of flow diagram of audio-frequency processing method provided by the present application;
Fig. 3 is process signal of the m monophone data clusters under a kind of application scenarios provided by the present application for n audio group Figure;
Fig. 4 is a kind of detailed process schematic diagram of audio-frequency processing method provided by the present application;
Fig. 5 is the flow diagram of another audio-frequency processing method provided by the present application;
Fig. 6 is the detailed process schematic diagram for being originally another audio-frequency processing method that crowd provides;
Fig. 7 is a kind of structural schematic diagram of audio processing system provided by the present application;
Fig. 8 is the structural schematic diagram of another audio processing system provided by the present application;
Fig. 9 is a kind of structural schematic diagram of server provided by the present application.
Specific embodiment
The application is described in further detail below by specific embodiment combination attached drawing.In the following embodiments and the accompanying drawings In, many datail descriptions are in order to enable the application can be better understood.However, those skilled in the art can be without lifting an eyebrow Recognize, part of feature is dispensed in varied situations, or can be substituted by other methods.Certain In the case of, the relevant some operations of the application are there is no display in the description or describe, this is the core in order to avoid the application Center portion point is flooded by excessive description.To those skilled in the art, be described in detail these relevant operations be not must It wants, they can completely understand relevant operation according to the general technology knowledge of description and this field in specification.
It is preferably understood for the ease of this programme, " words person's separation system " is simply introduced below.
Words person's separation refers to dividing the audio data in more people dialogue according to speaker, and marked Process.In the specific implementation, words person's separation system is generally based on bayesian information criterion (Bayesian Information Criterion, BIC) it is used as similarity measurement to carry out words person's separation, as shown in Figure 1, the technology mainly passes sequentially through input module 101, mute detection module 103, Speaker Identification module 103, switching point detection module 104, categorization module 105 and output mould Block 106, thus the audio result after being separated.Wherein, mute detection module 102 is used to remove in the audio data of input Mute part obtains second audio data;The vocal print that Speaker Identification module 103 has learnt speaker in a large number of services scene is special Sign, such as words person's separation system are used to separate customer service and the sound of user, then Speaker Identification module will learn a large amount of visitors Clothes and user's vocal print feature, such as the speech intonation feature of customer service, rhythm rhythm characteristic etc., so that Speaker Identification mould Block can determine the identity of the speaker of every section of dialogue in current audio data according to the second audio data, obtain third Audio data;Switching point detection module 104 is used to determine the dialogue switching point of speaker according to the third audio data;Turn It changes the mold block 105 to be used for according to dialogue switching point, to be multistage audio data by third audio data clip;Categorization module 106 The identity of speaker for detecting above-mentioned multistage audio data according to Speaker Identification module 103 is classified, thus Obtain the result of words person's separation.
In summary, which not only needs in the training stage (mute for the more model of business scenario data training Detection model, Speaker Identification model and switching point model), the testing process of detection-phase is also longer, needs to pass sequentially through Seven modules shown in FIG. 1 could obtain testing result, and time overhead is big.
Therefore, this application provides a kind of methods of audio processing, and this method testing process time is short, training pattern quantity Few, time overhead greatly reduces.Fig. 2 is a kind of method of audio processing provided by the present application, be the described method comprises the following steps:
S201: audio data is obtained.
In the specific implementation, including the voice generated when noise and n interlocutor engage in the dialogue in the audio data, n is Positive integer.Audio data specifically can be the audio files such as the telephonograph for needing to carry out words person's separation or video recording, than Such as, the recording file that user and service calls are linked up, video recording file of meeting etc..Wherein, the noise in the application is general It reads not need the non-targeted voice separated followed by words person, specifically can be ambient sound, mute and sound pick-up outfit produces Raw noise of equipment etc..It should be noted that ambient sound voice, such as user A may also occur in noisy dining room and visitor The audio data of B telephonic communication generation is taken, although other people one's voices in speech and voice in dining room at this time, is not belonging to connect down It needs to carry out the target voice of words person's separation, therefore will also be listed in noise.It should be understood that the example above is merely to illustrate, and Specific restriction cannot be constituted.
S202: according to the audio data, voice audio data and switching point are obtained.
In the specific implementation, the voice audio data is that the audio data removes the n dialogue obtained after the noise The voice generated when person engages in the dialogue, the switching point are that any one interlocutor in the n interlocutor is switched to another The talk time point of interlocutor.Wherein, switching point refers to that an interlocutor is switched to the talk time point of another interlocutor. As an example it is assumed that n=2, that is, two interlocutors, respectively user A and customer service B, under a certain application scenarios, A with B pairs Talking about audio data textual representation can be with are as follows:
(00:12) A: " I wants to look into telephone expenses "
(00:14) B: " may I ask you is to look into the machine telephone expenses "
(00:18) A: " yes "
(00:20) B: " just a moment,please "
(00:25) B: " there are also 25 yuan for your credit balance "
Wherein, the number before A, B represents the play time of current audio data, or perhaps the time of audio data Axis.Switching point shares 3 at this time, can be 00:14,00:18 and 00:20.It should be understood that in order to clearer this Shen of expression The main thought for the audio-frequency processing method that please be provide, therefore carry out conversation content directly in the form of text in the example above It indicates, but in actual process, since step S202 does not carry out speech recognition, switching point is not know In the case where the conversation content of A and B, obtained according only to the dialogue audio data of A and B.Also, the example above is only used for It is bright, specific restriction can not be constituted.
It is described according to the audio data in a specific embodiment, obtain voice audio data and switching point Include: that the audio data is inputted into voice disjunctive model, obtains the voice audio data, wherein the voice splitting die Type is the model obtained after being trained using known audio sample and corresponding known voice data sample to neural network; Meanwhile the audio data is inputted into switching point detection model (Speaker Change Detection, SCD), to obtain The switching point.Wherein, the switching point detection model is using known audio sample and corresponding known switching point sample The model obtained after being trained to neural network.Voice disjunctive model specifically can be voice activity detection model (Voice Activity Detection, VAD), using similar incidents detection scheme, the distinguishing characteristics between voice and non-voice is carried out Study, to obtain voice disjunctive model.It is understood that the testing result of VAD largely eliminate ambient noise, The disturbing factors such as non-targeted voice, noise of equipment, point that finally obtains mute and non-mute compared to only being detected in Fig. 1 embodiment It is higher from result precision.
It should be noted that in step S202, all due to voice disjunctive model and the input data of switching point detection model It is audio data, therefore voice separation can detect parallel progress with switching point, the two will not interfere with each other, to centainly reduce Time needed for words person's separation.Also, the switching point is the talk time point that the time shaft based on audio data obtains, therefore The time shaft of voice audio data after removing noise is still identical as audio data, to be next based on identical time shaft, Voice audio data is converted into m monophone data using switching point.Also, switching point detection model and voice disjunctive model Can also make full use of it is relevant with business scenario there is label data to be trained, thus further promoted switching point detection and The accuracy of voice separation.
S203: according to the switching point, the voice audio data is converted into m monophone data.
In the specific implementation, each monophone data in the m monophone data only include the voice of single interlocutor, m is Positive integer.Still by taking above-mentioned example as an example, after obtaining 3 switching points, audio data can be converted to 4 monophone data, this 4 A monophone data can be as shown in Figure 3.Wherein, the voice of only one interlocutor of each monophone data, for example, monophone data It can be " I wants to look into telephone expenses " this in short corresponding audio data, can also be that " just a moment,please ", " there are also 25 for your credit balance This corresponding audio data of two words of member ".That is, the voice for including in monophone data only has user A or customer service B.It answers Understand, the example above is merely to illustrate, and can not constitute specific restriction.Also, Fig. 3 is for the ease of better understanding monophone data Definition, Fig. 3 is to be shown monophone data in a text form, but in a practical situation, simultaneously due to step S203 Speech recognition is not carried out, therefore monophone data are only an audio data, there is no include text information.
S204: the m monophone data are clustered, n audio group is obtained.
In the specific implementation, the monophone data of each audio group in the n audio group belong to the same interlocutor.Still with For above-mentioned example, after obtaining 4 monophone data, (Clustering) is clustered to this 4 monophone data, obtains 2 A audio group, specifically can be as shown in figure 3, a kind of audio group are as follows: " I wants to look into telephone expenses " and " yes " corresponding 2 monophone numbers According to another kind of audio group are as follows: " may I ask you is to look into the machine telephone expenses ", " just a moment,please, and there are also 25 yuan for your credit balance " corresponding 2 A monophone data, are finally converted to n audio group for the audio data comprising n interlocutor's voice.
In a specific embodiment, the m monophone data are clustered described, obtain n audio group it Afterwards, the method also includes: by audio recognition method (Automatic Speech Recognition, ASR), by the m A monophone data are converted to corresponding m text information;According to the m text information, confirm each in the n audio group Interlocutor belonging to audio group.In the specific implementation, can determine each sound by the keyword or key words art in text information Frequency organizes the interlocutor, for example, can determine classification 2 according to keyword " may I ask " " just a moment,please " still by taking above-mentioned example as an example Interlocutor be customer service, and then determine that the interlocutor of classification 1 is user, or class determined according to keyword " I thinks " " looking into telephone expenses " Other 1 interlocutor is user, and then determines that the interlocutor of classification 2 is customer service etc., and the setting for keyword, the application does not make It is specific to limit.
In the specific implementation, dynamic time warping (Dynamic Time Warping, DTW), hidden Markov can be passed through (Hidden Markov Model, HMM) theoretical, vector quantization (Vector Quantization, VQ) technology and is based on people Artificial neural networks (Artificial Neural Network, ANN) etc. audio recognition method turns the m monophone data It is changed to corresponding m text information, the application is not especially limited.
It should be noted that since the input data in cluster process is m monophone data, the input in speech recognition process Data are also possible to m monophone data, therefore clustering can also be with parallel processing, to further be promoted with the process of speech recognition The process of entire words person's separation, improves the user experience.Fig. 4 shows a kind of stream of audio-frequency processing method provided by the present application Journey schematic diagram, as shown in Figure 4, audio-frequency processing method provided by the present application, compared with words person's separation method in Fig. 1 embodiment, Training stage needs only for business scenario data training voice disjunctive model and switching point detection model, the training stage when Between, the costs such as manpower substantially reduce, can be with parallel processing, so that the time needed for words person's separation is significantly in the step of detection-phase Shorten, also, using the interference of voice disjunctive model removal noise information, accuracy is also greatly promoted.
In the above method, by obtaining audio data;According to the audio data, voice audio data and switching are obtained Point;According to the switching point, the voice audio data is converted into m monophone data;The m monophone data are gathered Class obtains n audio group, wherein and the monophone data of each audio group in the n audio group belong to the same interlocutor, To confirm the identity of voice belonging to each audio group in the n audio group.So that audio data passes through voice splitting die Type can dramatically eliminate the disturbing factors such as noise, and then improve the accuracy that words person separates under complex environment, and And multiple steps of detection-phase can parallel processing, simplify words person separation process, improve separating rate.
Fig. 5 is the method for another audio processing provided by the present application.Wherein, audio-frequency processing method and Fig. 2 described in Fig. 5 Shown in audio-frequency processing method the difference is that, carry out the process of speech recognition in advance, the text for obtaining audio data is special Sign, allows switching point detection model to comprehensively consider audio frequency characteristics and text feature, more accurate detects interlocutor's Switching point.Audio-frequency processing method shown in fig. 5 the following steps are included:
S501: audio data is obtained.
In the specific implementation, including the voice generated when noise and n interlocutor engage in the dialogue in the audio data, n is Positive integer.Audio data specifically can be the audio files such as the telephonograph for needing to carry out words person's separation or video recording, than Such as, the recording file that user and service calls are linked up, video recording file of meeting etc..Wherein, the noise in the application is general It reads not need the non-targeted voice separated followed by words person, specifically can be ambient sound, mute and sound pick-up outfit produces Raw noise of equipment etc..It should be noted that ambient sound voice, such as user A may also occur in noisy dining room and visitor The audio data of B telephonic communication generation is taken, although other people one's voices in speech and voice in dining room at this time, is not belonging to connect down It needs to carry out the target voice of words person's separation, therefore will also be listed in noise.It should be understood that the example above is merely to illustrate, and Specific restriction cannot be constituted.
S502: by audio recognition method, the audio data is converted into corresponding text data.
In the specific implementation, can be by DTW, HMM theory, VQ technology and ANN etc. audio recognition method, by the sound According to corresponding text data is converted to, the application is not especially limited frequency.
S503: according to the audio data and text data, voice audio data and switching point are obtained.
In the specific implementation, the voice audio data is that the audio data removes the n dialogue obtained after the noise The voice generated when person engages in the dialogue, the switching point are that any one interlocutor in the n interlocutor is switched to another The talk time point of interlocutor.Wherein, switching point refers to that an interlocutor is switched to the talk time point of another interlocutor. As an example it is assumed that n=2, that is, two interlocutors, respectively user A and customer service B, under a certain application scenarios, A with B pairs Talking about audio data textual representation can be with are as follows:
(00:12) A: " I wants to look into telephone expenses "
(00:14) B: " may I ask you is to look into the machine telephone expenses "
(00:18) A: " yes "
(00:20) B: " just a moment,please "
(00:25) B: " there are also 25 yuan for your credit balance "
Wherein, the number before A, B represents the play time of current recording.Switching point shares 3 at this time, can be 00: 14,00:18 and 00:20.It should be understood that since step S502 has carried out speech recognition, switching point inspection to audio data Audio frequency characteristics and text feature can be comprehensively considered by surveying model, obtain more accurate switching point.The example above is only used for It is bright, specific restriction can not be constituted.
It is described according to the audio data and audio text in a specific embodiment, obtain voice audio number Accordingly and switching point includes: that the audio data is inputted voice disjunctive model, to obtain the voice audio data, In, the voice disjunctive model is to be carried out using known audio sample and corresponding known voice data sample to neural network The model obtained after training;Meanwhile by the audio data and audio text input switching point detection model, to obtain institute State switching point, wherein the switching point detection model is using known audio sample, known audio samples of text and corresponding The model that known switching point sample obtains after being trained to neural network.Voice disjunctive model specifically can be VAD model, adopt With similar incidents detection scheme, the distinguishing characteristics between voice and non-voice is learnt, to obtain voice disjunctive model. It is understood that the testing result of VAD largely eliminates the interference such as ambient noise, non-targeted voice, noise of equipment Factor, mute and non-mute compared to only detecting in Fig. 1 embodiment, the separating resulting accuracy finally obtained is higher.
It should be noted that since the input data of speech recognition steps is audio data, the input number of voice separating step According to for audio data, therefore two steps can carry out simultaneously, then switch over a detecting step.Alternatively, first carrying out voice knowledge Other step, then voice separating step and switching point detecting step are carried out simultaneously.To the greatest extent can executing tasks parallelly, most Time needed for big degree reduces words person's separation process.Also, the switching point is what the time shaft based on audio data obtained Talk time point, therefore the time shaft for removing the voice audio data after noise is still identical as audio data, so as to following base In identical time shaft, voice audio data is converted into m monophone data using switching point.Also, switching point detection model And voice disjunctive model can also make full use of it is relevant with business scenario there is label data to be trained, to further mention Rise the accuracy of switching point detection and voice separation.
It should be understood that the switching point detection model of the embodiment of the present application is using known audio sample, known audio text The model that this sample and corresponding known switching point sample obtain after being trained to neural network, the switching point detection model Audio frequency characteristics and text feature can be comprehensively considered, the more accurate switching point for detecting interlocutor further promotes words person Isolated accuracy.
S504: according to the switching point, the voice data are converted into m monophone data.
In the specific implementation, each monophone data in the m monophone data only include the voice of single interlocutor, m is Positive integer.Still by taking above-mentioned example as an example, after obtaining 3 switching points, audio data can be converted to 4 monophone data, this 4 A monophone data can be as shown in Figure 3.Wherein, the voice of only one interlocutor of each monophone data, for example, monophone data It can be " I wants to look into telephone expenses " this in short corresponding audio data, can also be that " just a moment,please ", " there are also 25 for your credit balance This corresponding audio data of two words of member ".That is, the voice for including in monophone data only has user A or customer service B.It answers Understand, the example above is merely to illustrate, and can not constitute specific restriction.Also, since step S502 carries out audio data Speech recognition, therefore text data can also be converted to m monophone text, each monophone text is corresponding with monophone data, such as Shown in Fig. 3, so as to followed by cluster.
S505: the m monophone data are clustered, n audio group is obtained.
In the specific implementation, the monophone data of each audio group in the n audio group belong to the same interlocutor.Still with For above-mentioned example, after obtaining 4 monophone data, this 4 monophone data is clustered, obtain 2 audio groups, specifically It can be as shown in figure 3, a kind of audio group are as follows: " I wants to look into telephone expenses " and " yes " corresponding 2 monophone data, another class audio frequency Group are as follows: " may I ask you is to look into the machine telephone expenses ", " just a moment,please, and there are also 25 yuan for your credit balance " corresponding 2 monophone data, most The audio data comprising n interlocutor's voice is converted to n audio group at last.
S506: according to the audio text confirm each audio group in the n audio group belonging to interlocutor.
In the specific implementation, can be by the keyword or key words art in the corresponding monophone text of each monophone data, really Interlocutor described in fixed each audio group, for example, still by taking above-mentioned example as an example, it can be true according to keyword " may I ask " " just a moment,please " The interlocutor of fixed classification 2 shown in Fig. 3 is customer service, and then determines that the interlocutor of classification 1 is user, or according to keyword " I Think " " looking into telephone expenses " determines that the interlocutor of classification 1 shown in Fig. 3 is user, and then determines that the interlocutor of classification 2 is customer service etc., Setting for keyword, the application are not especially limited.
Fig. 6 shows a kind of flow diagram of audio-frequency processing method provided by the present application, it will be appreciated from fig. 6 that the application mentions The audio-frequency processing method of confession is only needed in the training stage for business scenario number compared with words person's separation method in Fig. 1 embodiment It is substantially reduced according to training voice disjunctive model and switching point detection model, the costs such as time, the manpower of training stage, in detection rank The step of section, so that the time needed for words person's separation greatly shortens, also, can be removed with parallel processing using voice disjunctive model The interference of noise information, switching point detection model combine text feature and audio frequency characteristics, so that switching point detection is accurate Degree has significant increase, and then promotes the accuracy of words person's separation.
In the above method, by obtaining audio data;By audio recognition method, the audio data is converted to accordingly Text data;According to the audio data and text data, voice audio data and switching point are obtained;It is cut according to described It changes a little, the voice audio data is converted into m monophone data;The m monophone data are clustered, n sound is obtained Frequency group, wherein the monophone data of each audio group in the n audio group belong to the same interlocutor, to confirm the n The identity of voice belonging to each audio group in a audio group.So that audio data passes through voice disjunctive model, it can very big journey The disturbing factors such as noise are eliminated to degree, and then improve the accuracy that words person separates under complex environment, also, in detection-phase Multiple steps can parallel processing, simplify words person separation process, improve separating rate.
Fig. 7 is a kind of audio processing system provided by the present application, the audio processing system 700 include acquiring unit 710, Obtaining unit 720, converting unit 730 and cluster cell 740, wherein
The acquiring unit 710 is for obtaining audio data, wherein includes noise and n dialogue in the audio data The voice generated when person engages in the dialogue, n are positive integers;
The obtaining unit 720 is used to obtain voice audio data and switching point according to the audio data, wherein The voice audio data is that the audio data removes the audio data obtained after the noise, and the switching point is the n Any one interlocutor in a interlocutor is switched to the talk time point of another interlocutor;
The converting unit 730 is used for switching point, the voice audio data is converted to m monophone data, wherein institute The voice that each monophone data in m monophone data only include single interlocutor is stated, m is positive integer;
The cluster cell 740 obtains n audio group, wherein the n for clustering the m monophone data The monophone data in each audio group in a audio group belong to the same interlocutor.
In the specific implementation, including the voice generated when noise and n interlocutor engage in the dialogue in the audio data, n is Positive integer.Audio data specifically can be the audio files such as the telephonograph for needing to carry out words person's separation or video recording, than Such as, the recording file that user and service calls are linked up, video recording file of meeting etc..Wherein, the noise in the application is general It reads not need the non-targeted voice separated followed by words person, specifically can be ambient sound, mute and sound pick-up outfit produces Raw noise of equipment etc..It should be noted that ambient sound voice, such as user A may also occur in noisy dining room and visitor The audio data of B telephonic communication generation is taken, although other people one's voices in speech and voice in dining room at this time, is not belonging to connect down It needs to carry out the target voice of words person's separation, therefore will also be listed in noise.It should be understood that the example above is merely to illustrate, and Specific restriction cannot be constituted.
In the specific implementation, the voice audio data is that the audio data removes the n dialogue obtained after the noise The voice generated when person engages in the dialogue, the switching point are that any one interlocutor in the n interlocutor is switched to another The talk time point of interlocutor.Wherein, switching point refers to that an interlocutor is switched to the talk time point of another interlocutor. As an example it is assumed that n=2, that is, two interlocutors, respectively user A and customer service B, under a certain application scenarios, A with B pairs Talking about audio data textual representation can be with are as follows:
(00:12) A: " I wants to look into telephone expenses "
(00:14) B: " may I ask you is to look into the machine telephone expenses "
(00:18) A: " yes "
(00:20) B: " just a moment,please "
(00:25) B: " there are also 25 yuan for your credit balance "
Wherein, the number before A, B represents the play time of current recording.Switching point shares 3 at this time, can be 00: 14,00:18 and 00:20.It should be understood that in order to the main think of of clearer expression audio-frequency processing method provided by the present application Think, therefore is directly indicated conversation content in the form of text in the example above, but in actual process, by Speech recognition is not carried out in step S202, therefore switching point is the only root in the case where not knowing the conversation content of A and B It is obtained according to the dialogue audio data of A and B.Also, the example above is merely to illustrate, and can not constitute specific restriction.
In a specific embodiment, the obtaining unit 720 is used for: the audio data is inputted voice splitting die Type obtains the voice audio data, wherein the voice disjunctive model is to use known audio sample and corresponding known The model that voice data sample obtains after being trained to neural network;Meanwhile audio data input switching point being detected Model, to obtain the switching point, wherein the switching point detection model be using known audio sample and it is corresponding Know the model obtained after switching point sample is trained neural network.Wherein, the switching point detection model is using known The model that audio sample and corresponding known switching point sample obtain after being trained to neural network.Voice disjunctive model tool Body can be voice activity detection model (Voice Activity Detection, VAD), using similar incidents detection scheme, Distinguishing characteristics between voice and non-voice is learnt, to obtain voice disjunctive model.It is understood that VAD Testing result largely eliminates the disturbing factors such as ambient noise, non-targeted voice, noise of equipment, compares Fig. 1 embodiment In only detect mute and non-mute, the separating resulting accuracy finally obtained is higher.
It should be noted that in step S202, all due to voice disjunctive model and the input data of switching point detection model It is audio data, therefore voice separation can detect parallel progress with switching point, the two will not interfere with each other, to centainly reduce Time needed for words person's separation.Also, the switching point is the talk time point that the time shaft based on audio data obtains, therefore The time shaft of voice audio data after removing noise is still identical as audio data, to be next based on identical time shaft, Voice audio data is converted into m monophone data using switching point.Also, switching point detection model and voice disjunctive model Can also make full use of it is relevant with business scenario there is label data to be trained, thus further promoted switching point detection and The accuracy of voice separation.
In the specific implementation, each monophone data in the m monophone data only include the voice of single interlocutor, m is Positive integer.Still by taking above-mentioned example as an example, after obtaining 3 switching points, audio data can be converted to 4 monophone data, this 4 A monophone data can be as shown in Figure 3.Wherein, the voice of only one interlocutor of each monophone data, for example, monophone data It can be " I wants to look into telephone expenses " this in short corresponding audio data, can also be that " just a moment,please ", " there are also 25 for your credit balance This corresponding audio data of two words of member ".That is, the voice for including in monophone data only has user A or customer service B.It answers Understand, the example above is merely to illustrate, and can not constitute specific restriction.Also, Fig. 3 is for the ease of better understanding monophone data Definition, Fig. 3 is to be shown monophone data in a text form, but in a practical situation, simultaneously due to step S203 Speech recognition is not carried out, therefore monophone data are only an audio data, there is no include text information.
S204: the m monophone data are clustered, n audio group is obtained.
In the specific implementation, the monophone data of each audio group in the n audio group belong to the same interlocutor.Still with For above-mentioned example, after obtaining 4 monophone data, (Clustering) is clustered to this 4 monophone data, obtains 2 A audio group, specifically can be as shown in figure 3, a kind of audio group are as follows: " I wants to look into telephone expenses " and " yes " corresponding 2 monophone numbers According to another kind of audio group are as follows: " may I ask you is to look into the machine telephone expenses ", " just a moment,please, and there are also 25 yuan for your credit balance " corresponding 2 A monophone data, are finally converted to n audio group for the audio data comprising n interlocutor's voice.
In a specific embodiment, the system also includes confirmation unit 750, the confirmation unit 750 is used for: It is described to cluster the m monophone data, it is single by described m by audio recognition method after obtaining n audio group Sound data are converted to corresponding m text information;According to the m text information, each audio in the n audio group is confirmed Interlocutor belonging to group.In the specific implementation, can determine each audio group by the keyword or key words art in text information The interlocutor, for example, can determine pair of classification 2 according to keyword " may I ask " " just a moment,please " still by taking above-mentioned example as an example Words person is customer service, and then determines that the interlocutor of classification 1 is user, or determine classification 1 according to keyword " I thinks " " looking into telephone expenses " Interlocutor be user, and then determine that the interlocutor of classification 2 is customer service etc., the setting for keyword, the application does not make to have Body limits.
In the specific implementation, dynamic time warping (Dynamic Time Warping, DTW), hidden Markov can be passed through (Hidden Markov Model, HMM) theoretical, vector quantization (Vector Quantization, VQ) technology and is based on people Artificial neural networks (Artificial Neural Network, ANN) etc. audio recognition method turns the m monophone data It is changed to corresponding m text information, the application is not especially limited.
It should be noted that since the input data in cluster process is m monophone data, the input in speech recognition process Data are also possible to m monophone data, therefore clustering can also be with parallel processing, to further be promoted with the process of speech recognition The process of entire words person's separation, improves the user experience.As shown in Figure 7, audio processing system provided by the present application, with Fig. 1 Words person's separation system is compared in embodiment, is needed only for business scenario data training voice disjunctive model and is cut in the training stage Change a detection model, the costs such as time, the manpower of training stage substantially reduce, the detection-phase the step of can with parallel processing, So that the time needed for words person's separation greatly shortens, also, use the interference of voice disjunctive model removal noise information, accuracy Also it greatly promotes.
In above system, by obtaining audio data;According to the audio data, voice audio data and switching are obtained Point;According to the switching point, the voice audio data is converted into m monophone data;The m monophone data are gathered Class obtains n audio group, wherein and the monophone data of each audio group in the n audio group belong to the same interlocutor, To confirm the identity of voice belonging to each audio group in the n audio group.So that audio data passes through voice splitting die Type can dramatically eliminate the disturbing factors such as noise, and then improve the accuracy that words person separates under complex environment, and And multiple steps of detection-phase can parallel processing, simplify words person separation process, improve separating rate.
Fig. 8 is another audio processing system provided by the present application, wherein audio processing system described in Fig. 8 and Fig. 7 institute The audio processing system shown the difference is that, carry out the process of speech recognition in advance, obtain the text feature of audio data, Switching point detection model is allowed to comprehensively consider audio frequency characteristics and text feature, the more accurate switching for detecting interlocutor Point.Audio processing system 800 shown in Fig. 8 includes acquiring unit 810, the first converting unit 820,830, second turns of obtaining unit Change unit 840, cluster cell 850, confirmation unit 860, wherein
The acquiring unit 810 is for obtaining audio data, wherein includes noise and n dialogue in the audio data The voice generated when person engages in the dialogue, n are positive integers;
The converting unit 820 is used to that the audio data to be converted to audio text by audio recognition method;
The obtaining unit 830 is used for according to the audio data and audio text, obtain voice audio data and Switching point, wherein the voice audio data is that the audio data removes the audio data obtained after the noise, described to cut Change the talk time point that another interlocutor is a little switched to for any one interlocutor in the n interlocutor;
Second converting unit 840 is used for according to the voice audio data and switching point, by the voice audio Data are converted to m monophone data, wherein each monophone data in the m monophone data only include single interlocutor's Voice, m are positive integers;
The cluster cell 850 obtains n audio group, wherein the n for clustering the m monophone data The monophone data of each audio group in a audio group belong to the same interlocutor;
The confirmation unit 860 is used for according to belonging to each audio group in the audio text confirmation n audio group Interlocutor.
In the specific implementation, including the voice generated when noise and n interlocutor engage in the dialogue in the audio data, n is Positive integer.Audio data specifically can be the audio files such as the telephonograph for needing to carry out words person's separation or video recording, than Such as, the recording file that user and service calls are linked up, video recording file of meeting etc..Wherein, the noise in the application is general It reads not need the non-targeted voice separated followed by words person, specifically can be ambient sound, mute and sound pick-up outfit produces Raw noise of equipment etc..It should be noted that ambient sound voice, such as user A may also occur in noisy dining room and visitor The audio data of B telephonic communication generation is taken, although other people one's voices in speech and voice in dining room at this time, is not belonging to connect down It needs to carry out the target voice of words person's separation, therefore will also be listed in noise.It should be understood that the example above is merely to illustrate, and Specific restriction cannot be constituted.
In the specific implementation, can be by DTW, HMM theory, VQ technology and ANN etc. audio recognition method, by the sound According to corresponding text data is converted to, the application is not especially limited frequency.
In the specific implementation, the voice audio data is that the audio data removes the n dialogue obtained after the noise The voice generated when person engages in the dialogue, the switching point are that any one interlocutor in the n interlocutor is switched to another The talk time point of interlocutor.Wherein, switching point refers to that an interlocutor is switched to the talk time point of another interlocutor. As an example it is assumed that n=2, that is, two interlocutors, respectively user A and customer service B, under a certain application scenarios, A with B pairs Talking about audio data textual representation can be with are as follows:
(00:12) A: " I wants to look into telephone expenses "
(00:14) B: " may I ask you is to look into the machine telephone expenses "
(00:18) A: " yes "
(00:20) B: " just a moment,please "
(00:25) B: " there are also 25 yuan for your credit balance "
Wherein, the number before A, B represents the play time of current recording.Switching point shares 3 at this time, can be 00: 14,00:18 and 00:20.It should be understood that since step S502 has carried out speech recognition, switching point inspection to audio data Audio frequency characteristics and text feature can be comprehensively considered by surveying model, obtain more accurate switching point.The example above is only used for It is bright, specific restriction can not be constituted.
In a specific embodiment, the obtaining unit 830 is used for: the audio data is inputted voice splitting die Type, to obtain the voice audio data, wherein the voice disjunctive model is to use known audio sample and corresponding The model that known voice data sample obtains after being trained to neural network;Meanwhile it is the audio data and audio is literary This input switching point detection model, to obtain the switching point, wherein the switching point detection model is using known audio The model that sample, known audio samples of text and corresponding known switching point sample obtain after being trained to neural network. Voice disjunctive model specifically can be VAD model, special to the difference between voice and non-voice using similar incidents detection scheme Sign is learnt, to obtain voice disjunctive model.It is understood that the testing result of VAD largely eliminates ring The disturbing factors such as border noise, non-targeted voice, noise of equipment, it is mute and non-mute compared to only being detected in Fig. 1 embodiment, finally obtain The separating resulting accuracy obtained is higher.
It should be noted that since the input data of speech recognition steps is audio data, the input number of voice separating step According to for audio data, therefore two steps can carry out simultaneously, then switch over a detecting step.Alternatively, first carrying out voice knowledge Other step, then voice separating step and switching point detecting step are carried out simultaneously.To the greatest extent can executing tasks parallelly, most Time needed for big degree reduces words person's separation process.Also, the switching point is what the time shaft based on audio data obtained Talk time point, therefore the time shaft for removing the voice audio data after noise is still identical as audio data, so as to following base In identical time shaft, voice audio data is converted into m monophone data using switching point.Also, switching point detection model And voice disjunctive model can also make full use of it is relevant with business scenario there is label data to be trained, to further mention Rise the accuracy of switching point detection and voice separation.
It should be understood that the switching point detection model of the embodiment of the present application is using known audio sample, known audio text The model that this sample and corresponding known switching point sample obtain after being trained to neural network, the switching point detection model Audio frequency characteristics and text feature can be comprehensively considered, the more accurate switching point for detecting interlocutor further promotes words person Isolated accuracy.
In the specific implementation, each monophone data in the m monophone data only include the voice of single interlocutor, m is Positive integer.Still by taking above-mentioned example as an example, after obtaining 3 switching points, audio data can be converted to 4 monophone data, this 4 A monophone data can be as shown in Figure 3.Wherein, the voice of only one interlocutor of each monophone data, for example, monophone data It can be " I wants to look into telephone expenses " this in short corresponding audio data, can also be that " just a moment,please ", " there are also 25 for your credit balance This corresponding audio data of two words of member ".That is, the voice for including in monophone data only has user A or customer service B.It answers Understand, the example above is merely to illustrate, and can not constitute specific restriction.Also, since step S502 carries out audio data Speech recognition, therefore text data can also be converted to m monophone text, each monophone text is corresponding with monophone data, such as Shown in Fig. 3, so as to followed by cluster.
In the specific implementation, the monophone data of each audio group in the n audio group belong to the same interlocutor.Still with For above-mentioned example, after obtaining 4 monophone data, this 4 monophone data is clustered, obtain 2 audio groups, specifically It can be as shown in figure 3, a kind of audio group are as follows: " I wants to look into telephone expenses " and " yes " corresponding 2 monophone data, another class audio frequency Group are as follows: " may I ask you is to look into the machine telephone expenses ", " just a moment,please, and there are also 25 yuan for your credit balance " corresponding 2 monophone data, most The audio data comprising n interlocutor's voice is converted to n audio group at last.
In the specific implementation, can be by the keyword or key words art in the corresponding monophone text of each monophone data, really Interlocutor described in fixed each audio group, for example, still by taking above-mentioned example as an example, it can be true according to keyword " may I ask " " just a moment,please " The interlocutor of fixed classification 2 shown in Fig. 3 is customer service, and then determines that the interlocutor of classification 1 is user, or according to keyword " I Think " " looking into telephone expenses " determines that the interlocutor of classification 1 shown in Fig. 3 is user, and then determines that the interlocutor of classification 2 is customer service etc., Setting for keyword, the application are not especially limited.
It should be understood that as shown in Figure 8, words person's segregative line in audio processing system provided by the present application, with Fig. 1 embodiment System is compared, and is only needed in the training stage for business scenario data training voice disjunctive model and switching point detection model, training The costs such as time, the manpower in stage substantially reduce, can be with parallel processing, so that needed for words person's separation in the step of detection-phase Time greatly shortens, also, using the interference of voice disjunctive model removal noise information, switching point detection model combines text Feature and audio frequency characteristics so that the accuracy of switching point detection has significant increase, and then promote the accuracy of words person's separation.
In above system, by obtaining audio data;By audio recognition method, the audio data is converted to accordingly Text data;According to the audio data and text data, voice audio data and switching point are obtained;It is cut according to described It changes a little, the voice audio data is converted into m monophone data;The m monophone data are clustered, n sound is obtained Frequency group, wherein the monophone data of each audio group in the n audio group belong to the same interlocutor, to confirm the n The identity of voice belonging to each audio group in a audio group.So that audio data passes through voice disjunctive model, it can very big journey The disturbing factors such as noise are eliminated to degree, and then improve the accuracy that words person separates under complex environment, also, in detection-phase Multiple steps can parallel processing, simplify words person separation process, improve separating rate.
It is a kind of structural schematic diagram of server provided herein referring to Fig. 9, Fig. 9.The server may be implemented as Method in Fig. 2 embodiment or Fig. 5 embodiment.It should be noted that data processing method provided by the present application can be such as It realizes, can also be realized in single computer node and memory node, the application does not make to have in cloud service cluster shown in Fig. 9 Body limits, and the cloud service cluster includes at least one calculate node 910 and at least one memory node 920.
Calculate node 910 includes one or more processors 911, communication interface 912 and memory 913.Wherein, processor 911, it can be connected by bus 914 between communication interface 912 and memory 913.
Processor 911 includes one or more general processor, wherein general processor, which can be, is capable of handling electronics Any kind of equipment of instruction, including it is central processing unit (Central Processing Unit, CPU), microprocessor, micro- Controller, primary processor, controller and ASIC (Application Specific Integrated Circuit, dedicated collection At circuit) etc..It can be only for the application specific processor of calculate node 910 or can be total to other calculate nodes 910 It enjoys.Processor 911 executes various types of stored digital instructions, such as the software or firmware journey that are stored in memory 913 Sequence, it can make calculate node 910 provide wider a variety of services.For example, processor 911 be able to carry out cluster, voice separation with And the code of the modules such as switching point detection, to execute at least part for the method being discussed herein.
Communication interface 912 can be wireline interface (such as Ethernet interface), for other calculate nodes or user into Row communication.When communication interface 912 is wireline interface, communication interface 912 can use the protocol suite on TCP/IP, for example, RAAS agreement, remote function calls (Remote Function Call, RFC) agreement, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) agreement, Simple Network Management Protocol (Simple Network Management Protocol, SNMP) agreement, Common Object Request Broker Architecture (Common Object Request Broker Architecture,
CORBA) agreement and distributed protocol etc..
Memory 913 may include volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM);Memory also may include nonvolatile memory (Non-Volatile ), such as read-only memory (Read-Only Memory, ROM), flash memory (Flash Memory), hard disk Memory (Hard Disk Drive, HDD) or solid state hard disk (Solid-State Drive, SSD) memory can also include above-mentioned kind The combination of the memory of class.
Memory node 920 includes one or more storage controls 921 and storage array 922.Wherein, storage control It can be connected by bus 924 between 921 and storage array 922.
Storage control 921 includes one or more general processor, wherein general processor, which can be, to be capable of handling Any kind of equipment, including CPU, microprocessor, microcontroller, primary processor, controller and ASIC of e-command etc. Deng.It can be only for single memory node 920 application specific processor or can be with calculate node 910 or other storages Node 920 is shared.It is appreciated that in the present embodiment, each memory node includes a storage control, implement in others In example, a storage control can also be shared with multiple memory nodes, be not especially limited herein.
Memory array 922 may include multiple memories 923.Memory 923 can be nonvolatile memory, such as ROM, flash memory, HDD or SSD memory can also include the combination of the memory of mentioned kind.For example, storage array can To be made of multiple HDD or multiple SDD, it is made of alternatively, storage array can be HDD and SDD.Wherein, Duo Gecun Reservoir storage control 921 by assistance under be combined in different ways and form memory group, to provide than single The higher storage performance of memory and offer technology of data copy.Optionally, memory array 923 may include one or more A data center.Multiple data centers, which can be set, to be in the same localities, alternatively, do not make to have respectively in different places herein Body limits.Memory array 923 can store program code and program data.Wherein, program code includes speech recognition Block code, semantic understanding block code and order production block code etc..Program data includes: cluster module code number According to, voice separation module code data, switching point detection module code data etc., can also include order voice disjunctive model, Switching point detection model and corresponding training sample set data etc., the application is not especially limited.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present application.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line) or wireless (such as infrared, wireless, microwave etc.) mode into another web-site, computer, server or data The heart is transmitted.The computer readable storage medium can be any usable medium or include that computer can access The data storage devices such as one or more usable mediums integrated server, data center.The usable medium can be magnetism Medium, (for example, floppy disk, storage dish, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state memory disc Solid State Disk (SSD)) etc..
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The foregoing is merely several embodiments of the application, those skilled in the art is according to can be with disclosed in application documents Various changes or modifications are carried out without departing from spirit and scope to the application.

Claims (10)

1. a kind of method of audio processing, which is characterized in that the described method includes:
Obtain audio data, wherein include the voice generated when noise and n interlocutor engage in the dialogue, n in the audio data It is positive integer;
According to the audio data, voice audio data and switching point are obtained, wherein the voice audio data is the sound Frequency is the n right according to the voice generated when the n interlocutor obtained after the noise engages in the dialogue, the switching point is removed Any one interlocutor in words person is switched to the talk time point of another interlocutor;
According to the switching point, the voice audio data is converted into m monophone data, wherein in the m monophone data Each monophone data only include the voice of single interlocutor, m is positive integer;
The m monophone data are clustered, n audio group is obtained, wherein each audio group in the n audio group Monophone data belong to the same interlocutor.
2. being obtained the method according to claim 1, wherein clustering the m monophone data described After n audio group, the method also includes:
By audio recognition method, the m monophone data are converted into corresponding m text information;
According to the m text information, interlocutor belonging to each audio group in the n audio group is confirmed.
3. the method according to claim 1, wherein described according to the audio data, acquisition voice audio number Accordingly and switching point includes:
The audio data is inputted into voice disjunctive model, obtains the voice audio data, wherein the voice disjunctive model It is the model obtained after being trained using known audio sample and corresponding known voice data sample to neural network;Together When,
The audio data is inputted into switching point detection model, to obtain the switching point, wherein the switching point detects mould Type is the model obtained after being trained using known audio sample and corresponding known switching point sample to neural network.
4. a kind of method of audio processing, which is characterized in that the described method includes:
Obtain audio data, wherein include the voice generated when noise and n interlocutor engage in the dialogue, n in the audio data It is positive integer;
By audio recognition method, the audio data is converted into corresponding text data;
According to the audio data and text data, voice audio data and switching point are obtained, wherein the voice audio Data are that the audio data removes the audio data obtained after the noise, and the switching point is in the n interlocutor Any one interlocutor is switched to the talk time point of another interlocutor;
According to the switching point, the voice data are converted into m monophone data, wherein every in the m monophone data A monophone data only include the voice of single interlocutor, and m is positive integer;
The m monophone data are clustered, n audio group is obtained, wherein each audio group in the n audio group Monophone data belong to the same interlocutor;
According to the audio text confirm each audio group in the n audio group belonging to interlocutor.
5. according to the method described in claim 4, obtaining it is characterized in that, described according to the audio data and audio text It obtains voice audio data and switching point includes:
The audio data is inputted into voice disjunctive model, to obtain the voice audio data, wherein the voice separation Model is the mould obtained after being trained using known audio sample and corresponding known voice data sample to neural network Type;Meanwhile
By the audio data and audio text input switching point detection model, to obtain the switching point, wherein described Switching point detection model is using known audio sample, known audio samples of text and corresponding known switching point sample to mind The model obtained after network is trained.
6. a kind of audio processing system, which is characterized in that the system comprises:
Acquiring unit, the acquiring unit is for obtaining audio data, wherein right comprising noise and n in the audio data The voice generated when words person engages in the dialogue, n are positive integers;
Obtaining unit, the obtaining unit are used to obtain voice audio data and switching point according to the audio data, In, the voice audio data is that the audio data removes the audio data obtained after the noise, and the switching point is institute State the talk time point that any one interlocutor in n interlocutor is switched to another interlocutor;
Converting unit, the converting unit are used for switching point, the voice audio data are converted to m monophone data, wherein Each monophone data in the m monophone data only include the voice of single interlocutor, and m is positive integer;
Cluster cell, the cluster cell obtain n audio group, wherein institute for clustering the m monophone data The monophone data stated in each audio group in n audio group belong to the same interlocutor.
7. a kind of audio processing system, which is characterized in that the system comprises:
Acquiring unit, the acquiring unit is for obtaining audio data, wherein includes that noise and n are right in the audio data The voice generated when words person engages in the dialogue, n are positive integers;
First converting unit, the converting unit are used for through audio recognition method, and the audio data is converted to audio text This;
Obtaining unit, the obtaining unit are used for according to the audio data and audio text, obtain voice audio data with And switching point, wherein the voice audio data is that the audio data removes the audio data obtained after the noise, described Switching point is switched to the talk time point of another interlocutor for any one interlocutor in the n interlocutor;
Second converting unit, second converting unit are used for according to the voice audio data and switching point, by the people Sound audio data are converted to m monophone data, wherein each monophone data in the m monophone data only include single right The voice of words person, m are positive integers;
Cluster cell, the cluster cell obtain n audio group, wherein institute for clustering the m monophone data The monophone data for stating each audio group in n audio group belong to the same interlocutor;
Confirmation unit, the confirmation unit are used to confirm each audio group institute in the n audio group according to the audio text The interlocutor of category.
8. a kind of server, which is characterized in that including processor and memory;The memory is for storing instruction, described Processor is executed when executing described instruction for executing described instruction, the processor as any in claims 1 to 3,4 or 5 Method described in.
9. a kind of computer non-transitory storage media, which is characterized in that the computer non-transitory storage media is stored with calculating Machine program, which is characterized in that realized when the computer program is executed by calculating equipment as any in claims 1 to 3,4 or 5 Method described in.
10. a kind of computer program product, which is characterized in that when the computer program product is read and executed by a computer When, realize the method as described in any one of claims 1 to 3,4 or 5.
CN201910453493.7A 2019-05-28 2019-05-28 Method, system and the relevant device of audio processing Pending CN110335621A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910453493.7A CN110335621A (en) 2019-05-28 2019-05-28 Method, system and the relevant device of audio processing
PCT/CN2019/130550 WO2020238209A1 (en) 2019-05-28 2019-12-31 Audio processing method, system and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910453493.7A CN110335621A (en) 2019-05-28 2019-05-28 Method, system and the relevant device of audio processing

Publications (1)

Publication Number Publication Date
CN110335621A true CN110335621A (en) 2019-10-15

Family

ID=68140272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910453493.7A Pending CN110335621A (en) 2019-05-28 2019-05-28 Method, system and the relevant device of audio processing

Country Status (2)

Country Link
CN (1) CN110335621A (en)
WO (1) WO2020238209A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930989A (en) * 2019-11-27 2020-03-27 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN111243595A (en) * 2019-12-31 2020-06-05 京东数字科技控股有限公司 Information processing method and device
CN111968679A (en) * 2020-10-22 2020-11-20 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
WO2020238209A1 (en) * 2019-05-28 2020-12-03 深圳追一科技有限公司 Audio processing method, system and related device
CN112242137A (en) * 2020-10-15 2021-01-19 上海依图网络科技有限公司 Training of human voice separation model and human voice separation method and device
CN112289323A (en) * 2020-12-29 2021-01-29 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN112562644A (en) * 2020-12-03 2021-03-26 云知声智能科技股份有限公司 Customer service quality inspection method, system, equipment and medium based on human voice separation
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112735384A (en) * 2020-12-28 2021-04-30 科大讯飞股份有限公司 Turning point detection method, device and equipment applied to speaker separation
CN112802498A (en) * 2020-12-29 2021-05-14 深圳追一科技有限公司 Voice detection method and device, computer equipment and storage medium
CN112966082A (en) * 2021-03-05 2021-06-15 北京百度网讯科技有限公司 Audio quality inspection method, device, equipment and storage medium
CN113593578A (en) * 2021-09-03 2021-11-02 北京紫涓科技有限公司 Conference voice data acquisition method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1697472A (en) * 2004-05-14 2005-11-16 华为技术有限公司 Method and device for switching speeches
CN101120397A (en) * 2005-01-17 2008-02-06 日本电气株式会社 Speech recognition system, speech recognition method, and speech recognition program
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN105895078A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method used for dynamically selecting speech model and device
US20180249124A1 (en) * 2017-02-28 2018-08-30 Cisco Technology, Inc. Group and conversational framing for speaker tracking in a video conference system
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109300470A (en) * 2018-09-17 2019-02-01 平安科技(深圳)有限公司 Audio mixing separation method and audio mixing separator
CN109545228A (en) * 2018-12-14 2019-03-29 厦门快商通信息技术有限公司 A kind of end-to-end speaker's dividing method and system
CN109634692A (en) * 2018-10-23 2019-04-16 蔚来汽车有限公司 Vehicle-mounted conversational system and processing method and system for it
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543402B1 (en) * 2010-04-30 2013-09-24 The Intellisis Corporation Speaker segmentation in noisy conversational speech
US10614797B2 (en) * 2016-12-01 2020-04-07 International Business Machines Corporation Prefix methods for diarization in streaming mode
US10559311B2 (en) * 2017-03-31 2020-02-11 International Business Machines Corporation Speaker diarization with cluster transfer
CN107578770B (en) * 2017-08-31 2020-11-10 百度在线网络技术(北京)有限公司 Voice recognition method and device for network telephone, computer equipment and storage medium
CN108399923B (en) * 2018-02-01 2019-06-28 深圳市鹰硕技术有限公司 More human hairs call the turn spokesman's recognition methods and device
CN108922538B (en) * 2018-05-29 2023-04-07 平安科技(深圳)有限公司 Conference information recording method, conference information recording device, computer equipment and storage medium
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1697472A (en) * 2004-05-14 2005-11-16 华为技术有限公司 Method and device for switching speeches
CN101120397A (en) * 2005-01-17 2008-02-06 日本电气株式会社 Speech recognition system, speech recognition method, and speech recognition program
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN105895078A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method used for dynamically selecting speech model and device
US20180249124A1 (en) * 2017-02-28 2018-08-30 Cisco Technology, Inc. Group and conversational framing for speaker tracking in a video conference system
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109300470A (en) * 2018-09-17 2019-02-01 平安科技(深圳)有限公司 Audio mixing separation method and audio mixing separator
CN109634692A (en) * 2018-10-23 2019-04-16 蔚来汽车有限公司 Vehicle-mounted conversational system and processing method and system for it
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN109545228A (en) * 2018-12-14 2019-03-29 厦门快商通信息技术有限公司 A kind of end-to-end speaker's dividing method and system

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020238209A1 (en) * 2019-05-28 2020-12-03 深圳追一科技有限公司 Audio processing method, system and related device
CN110930989A (en) * 2019-11-27 2020-03-27 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN110930989B (en) * 2019-11-27 2021-04-06 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN111243595A (en) * 2019-12-31 2020-06-05 京东数字科技控股有限公司 Information processing method and device
CN112242137A (en) * 2020-10-15 2021-01-19 上海依图网络科技有限公司 Training of human voice separation model and human voice separation method and device
CN111968679A (en) * 2020-10-22 2020-11-20 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
CN112562644A (en) * 2020-12-03 2021-03-26 云知声智能科技股份有限公司 Customer service quality inspection method, system, equipment and medium based on human voice separation
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112735384A (en) * 2020-12-28 2021-04-30 科大讯飞股份有限公司 Turning point detection method, device and equipment applied to speaker separation
CN112289323A (en) * 2020-12-29 2021-01-29 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN112802498A (en) * 2020-12-29 2021-05-14 深圳追一科技有限公司 Voice detection method and device, computer equipment and storage medium
CN112289323B (en) * 2020-12-29 2021-05-28 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN112802498B (en) * 2020-12-29 2023-11-24 深圳追一科技有限公司 Voice detection method, device, computer equipment and storage medium
CN112966082A (en) * 2021-03-05 2021-06-15 北京百度网讯科技有限公司 Audio quality inspection method, device, equipment and storage medium
CN113593578A (en) * 2021-09-03 2021-11-02 北京紫涓科技有限公司 Conference voice data acquisition method and system

Also Published As

Publication number Publication date
WO2020238209A1 (en) 2020-12-03

Similar Documents

Publication Publication Date Title
CN110335621A (en) Method, system and the relevant device of audio processing
US10249292B2 (en) Using long short-term memory recurrent neural network for speaker diarization segmentation
CN107818798A (en) Customer service quality evaluating method, device, equipment and storage medium
RU2711153C2 (en) Methods and electronic devices for determination of intent associated with uttered utterance of user
CN107463700B (en) Method, device and equipment for acquiring information
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN108073600A (en) A kind of intelligent answer exchange method, device and electronic equipment
Triantafyllopoulos et al. Deep speaker conditioning for speech emotion recognition
CN111460111A (en) Evaluating retraining recommendations for automatic conversation services
CN110188361A (en) Speech intention recognition methods and device in conjunction with text, voice and emotional characteristics
US10395643B2 (en) Language-independent, non-semantic speech analytics
EP4297030A2 (en) Polling questions for a conference call discussion
CN110570847A (en) Man-machine interaction system and method for multi-person scene
EP4323988A1 (en) End-to-end speech diarization via iterative speaker embedding
CN109410934A (en) A kind of more voice sound separation methods, system and intelligent terminal based on vocal print feature
CN112673641B (en) Inline response to video or voice messages
CN117441165A (en) Reducing bias in generating language models
CN111508530B (en) Speech emotion recognition method, device and storage medium
KR20200140235A (en) Method and device for building a target speaker's speech model
EP4020468A1 (en) System, method and apparatus for conversational guidance
CN115273862A (en) Voice processing method, device, electronic equipment and medium
CN112634879B (en) Voice conference management method, device, equipment and medium
CN113312928A (en) Text translation method and device, electronic equipment and storage medium
CN113506565A (en) Speech recognition method, speech recognition device, computer-readable storage medium and processor
CN112562688A (en) Voice transcription method, device, recording pen and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015

RJ01 Rejection of invention patent application after publication