CN110491392A

CN110491392A - A kind of audio data cleaning method, device and equipment based on speaker's identity

Info

Publication number: CN110491392A
Application number: CN201910809574.6A
Authority: CN
Inventors: 许敏强; 杨世清; 刘敏; 蒋敬; 王泽龙; 张露露
Original assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Current assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-22

Abstract

The embodiment of the present application discloses a kind of audio data cleaning method, device and equipment based on speaker's identity, comprising: is decoded to the original audio data of acquisition；Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice；Effective audio described in segment processing obtains the audio of several segmentations, and every section audio corresponds to single people in the audio of several segmentations；Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person；Vocal print cluster will be carried out by corresponding first audio data from multistage original audio, obtain second audio data, second audio data is labeled.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, and the segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.

Description

A kind of audio data cleaning method, device and equipment based on speaker's identity

Technical field

This application involves audio data processing technology field more particularly to a kind of audio data based on speaker's identity are clear Washing method, device and equipment.

Background technique

Information technology development it is current, there is a large amount of voice data to generate daily, they are in telephonograph, network Very important role is play in the service applications of multiple fields such as phone, wechat call.However, due to the acquisition side of data Formula is varied, often occurs accompanying that noise, mute time are long, situations such as being mingled with foreign language, speaking there are more people, this kind of data without Method directly uses, it is necessary to carry out data cleansing processing.

Have benefited from the fast development of the information technologies such as cloud computing, big data, deep learning in recent years, vocal print technology has been sent out Open up more mature, recognition accuracy is also relatively high, and application range is also relatively broad.In addition to the most common sound groove recognition technology in e Outside, the new technologies such as languages identification, words person's separation, vocal print cluster occur successively, also provide more for the data cleansing of fining Technical conditions, however there are no the voice messagings for isolating different people in audio that can be refined in existing technology.

Summary of the invention

The embodiment of the present application provides a kind of audio data cleaning method, device and equipment based on speaker's identity, leads to It crosses and segment processing is carried out to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, The segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.

In view of this, the application first aspect provides a kind of audio data cleaning method based on speaker's identity, institute The method of stating includes:

The original audio data of acquisition is decoded；

Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice；

Effective audio described in segment processing obtains the audio of several segmentations, every section audio in the audio of several segmentations Corresponding single people；

Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person；

First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, the second audio data includes the first audio data that multistage belongs to same people.

Preferably, after isolating effective audio in the decoded audio data, further includes: to effective sound Frequency carries out languages identification, and isolates effective audio of required languages.

Preferably, described that languages identification is carried out to effective audio, and the audio for extracting required languages includes: by institute Effective audio input is stated into languages identification model, identifies required languages, and extracts the audio of corresponding languages.

Preferably, effective audio described in the segment processing obtains the audio of several segmentations, the audio of several segmentations In every section audio correspond to single people and include:

Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several sounds being segmented Frequently, every section audio corresponds to single people in several audios.

Preferably, the audio to several segmentations carries out words person's separation, isolates and belongs to the first of the same person Audio data includes:

Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates and belongs to the first of the same person The audio fragment for belonging to the same person is divided into one kind by audio data.

Preferably, described that first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtain the Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people Frequency evidence specifically:

First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, is obtained To second audio data, second audio data is labeled, the second audio data includes that multistage belongs to same people One audio data.

The application second aspect provides a kind of audio data cleaning device based on speaker's identity, and described device includes:

Decoder module, the decoder module is for being decoded the original audio data of acquisition；

Audio separation module, the audio separation module are used to isolate effective sound in the decoded audio data Frequently, effective audio is the segment comprising voice；

Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described Every section audio corresponds to single people in the audio of several segmentations；

Words person's separation module, words person's separation module divide for carrying out words person's separation to the audio of several segmentations Separate out the first audio data for belonging to the same person；

Vocal print cluster module, the vocal print cluster module are used for the first audio number corresponding in multistage original audio According to vocal print cluster is carried out, second audio data is obtained, second audio data is labeled, the second audio data includes more Section belongs to the first audio data of same people.

Preferably, further includes: languages separation module, the languages separation module are used to carry out languages to effective audio Identification, and isolate effective audio of required languages.

The application third aspect provides a kind of audio data cleaning equipment based on speaker's identity, and the equipment includes place Manage device and memory:

Said program code is transferred to the processor for storing program code by the memory；

The processor is used to execute a kind of base as described in above-mentioned first aspect according to the instruction in said program code In the audio data cleaning method of speaker's identity the step of.

The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is for depositing Program code is stored up, said program code is for executing method described in above-mentioned first aspect.

As can be seen from the above technical solutions, the embodiment of the present application has the advantage that

The application provides a kind of audio data cleaning method based on speaker's identity, first to the original audio number of acquisition According to being decoded；Isolate effective audio in decoded audio data, wherein effective audio is the segment comprising voice；Point Section handles effective audio, obtains the audio of several segmentations, wherein every section audio corresponds to single people in the audio of several segmentations；It is right The audio of several segmentations carries out words person's separation, isolates the first audio data for belonging to the same person；By the original sound of multistage Corresponding first audio data carries out vocal print cluster in frequency, obtains second audio data, second audio data is marked Note.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then by audio piece Duan Jinhang cluster, the segment for belonging to the same person is polymerize, so that the voice for isolating different people in audio of fining Information.

Detailed description of the invention

Fig. 1 is a kind of method flow of one embodiment of the audio data cleaning method based on speaker's identity of the application Figure；

Fig. 2 is a kind of method stream of another embodiment of the audio data cleaning method based on speaker's identity of the application Cheng Tu；

Fig. 3 is words person's separation in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application Schematic diagram；

Fig. 4 is vocal print cluster in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application The schematic diagram of process；

Fig. 5 is a kind of apparatus structure of one embodiment of the audio data cleaning device based on speaker's identity of the application Figure.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.

In order to make it easy to understand, referring to Fig. 1, Fig. 1 is a kind of audio data cleaning side based on speaker's identity of the application The method flow diagram of one embodiment of method, as shown in Figure 1, including: in Fig. 1

101, the original audio data of acquisition is decoded.

It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format, This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice Formula can be wav format.

102, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.

It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data Sound segment；Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place Reason can also be removed using other modes；Vocal segments can also be directly intercepted out from audio data, such as are had by detection The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.

103, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people.

It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.

104, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.

It should be noted that the audio to several segmentations carries out words person's separation, the first audio number of the same person will be belonged to According to separating, algorithm used by separating can be HAC clustering algorithm, can also be gathered using other clustering algorithms It closes, so that being aggregated to the audio fragment for belonging to the same speaker together from the audio of several segmentations, thus by different theorys The audio of words people separates.

105, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.

It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing, In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number According to being labeled, speaker a or vocal print a can be labeled as.

For the embodiment of the present application by carrying out segment processing to audio data, obtained every section audio segment is all corresponding single People, then audio fragment is clustered, the segment for belonging to the same person is polymerize, so that fining is isolated in audio The voice messaging of different people.

In order to make it easy to understand, referring to Fig. 2, Fig. 2 is a kind of audio data cleaning side based on speaker's identity of the application The method flow diagram of another embodiment of method, as shown in Fig. 2, specifically:

201, the original audio data of acquisition is decoded.

202, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.

203, languages identification is carried out to effective audio, and isolates effective audio of required languages.

It should be noted that effectively audio can be the voice that Chinese speech is also possible to non-Chinese, therefore can be directed to Required languages are screened, and the languages of needs are selected.

In a preferred embodiment, required languages can be identified by effective audio input into languages identification model, And extract the audio of corresponding languages.

It should be noted that languages identification model can be the model comprising neural network, which can pass through extraction The phonetic feature of Chinese and non-Chinese speech, and will be input in the model and instruct including the speech samples of Chinese and non-Chinese Languages identification model is got, for example, effective audio input to be measured to languages is identified mould when needing to carry out languages separation In type, effective audio frequency characteristics are extracted, and score judgement is carried out to effective audio by languages identification model, it is right by score height Effective audio is classified, and judges that effective audio is that belong to Chinese audio also be non-Chinese audio.

204, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people.

In a preferred embodiment, effective audio input of separation is subjected to segment processing into words person's disjunctive model, obtained To several audios of segmentation, every section audio corresponds to single people in several audios.

It should be noted that words person's disjunctive model can be the model comprising neural network, such as deep neural network DNN Model can extract speaker and switch the feature of turning point, and is input in DNN model and instructed by a large amount of audio datas Practice, obtains words person's disjunctive model, cut effective audio input to be measured into DNN model, is extracted speaker in effective audio The turning point changed, and then effective audio is cut according to turning point, several audios being segmented, wherein what is be segmented is several Every section audio corresponds to single people in audio.

205, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.

It should be noted that carrying out vocal print cluster to the audio of several segmentations, the clustering algorithm that can be used includes HAC poly- Class algorithm can also be clustered using other clustering algorithms, so that will belong to the same theory from the audio of several segmentations The audio fragment of words people is aggregated to together, so that the audio of different speakers be separated.

In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 3 One embodiment in cluster process schematic diagram, comprising: the audio sample used in figure includes the audio sample of two speakers This, using deep learning DNN model as words person's disjunctive model, the clustering algorithm used is HAC clustering algorithm.

Original audio is input to based on deep learning DNN in person's disjunctive model first, by audio sample to be measured It is input in DNN model；Extract the turning point that speaker in effective audio switches, so according to turning point to effective audio into Row cutting, several audios being segmented, wherein every section audio corresponds to single people in several audios being segmented；It is clustered using HAC Several audios of segmentation polymerize according to timing by algorithm respectively obtains the speech samples for belonging to two people.

206, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.

It is preferably carried out in mode in one kind, according to hierarchical clustering algorithm by the first audio corresponding in multistage original audio Data carry out vocal print cluster, obtain second audio data, second audio data is labeled, second audio data includes multistage Belong to the first audio data of same people.

It should be noted that hierarchical clustering algorithm therein can be CURE (Clustering Using Representatives, using the clustering algorithm of representative), can also be such as AGNES (AGglomerative NESting, Agglomerate nested algorithm) algorithm, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies, using layering equilibrium iteration reduce and cluster algorithm), ROCK (Robust Clustering using Links, using the healthy and strong swarm algorithm of link) or the hierarchical clustering algorithms such as CHAMELEON (chameleon algorithm).

In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 4 One embodiment in vocal print cluster process schematic diagram, wherein use including 4 section of first audio data, and use CURE Hierarchical clustering algorithm, certainly practical the first audio data that may include multistage in the middle, the hierarchical clustering algorithm used can also To be other hierarchical clustering algorithms.

4 section audio data are clustered using CURE hierarchical clustering algorithm first, that is, calculate the phase between different audios Like property, similarity is greater than preset threshold value, then is classified as one kind, and such is labeled, such as be labeled as speaker A With speaker B.

In order to make it easy to understand, referring to Fig. 5, Fig. 5 is that a kind of audio data based on speaker's identity of the application cleans dress The structure drawing of device of the one embodiment set, as shown in figure 5, specifically including:

Decoder module 501, for being decoded to the original audio data of acquisition.

Audio separation module 502, for isolating effective audio in decoded audio data, effective audio is to include people The segment of sound.

Segmentation module 503 is used for the effective audio of segment processing, obtains the audio of several segmentations, in the audio of several segmentations Every section audio corresponds to single people.

Words person's separation module 504 is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the same person's First audio data.

Vocal print cluster module 505 is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster To second audio data, second audio data is labeled, second audio data includes the first sound that multistage belongs to same people Frequency evidence.

The embodiment of the present application also provides a kind of another implementation of audio data cleaning device based on speaker's identity Example, specifically includes:

Decoder module, for being decoded to the original audio data of acquisition；

Audio separation module, for isolating effective audio in decoded audio data, effective audio is to include voice Segment；

Languages separation module for carrying out languages identification to effective audio, and isolates effective audio of required languages.

Segmentation module is used for the effective audio of segment processing, obtains the audio of several segmentations, every section in the audio of several segmentations Audio corresponds to single people；

Words person's separation module is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the of the same person One audio data；

Vocal print cluster module is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster Second audio data is labeled by second audio data, and second audio data includes the first audio that multistage belongs to same people Data.

The embodiment of the present application also provides a kind of audio data cleaning equipment based on speaker's identity, and equipment includes processor And memory: program code is transferred to processor for storing program code by memory；Processor is used for according to program A kind of the step of instruction in code, execution such as audio data cleaning method based on speaker's identity of above-described embodiment.

The embodiment of the present application also provides a kind of computer readable storage medium, for storing program code, the program code Any one implementation in a kind of audio data cleaning method based on speaker's identity for executing foregoing individual embodiments Mode.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

The description of the present application and term " first " in above-mentioned attached drawing, " second " etc. are to be used to distinguish similar objects, Without being used to describe a particular order or precedence order.It should be understood that the data used in this way in the appropriate case can be mutual It changes, so that embodiments herein described herein for example can be with the sequence other than those of illustrating or describing herein Implement.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, The process, method, system, product or equipment for containing a series of steps or units those of are not necessarily limited to be clearly listed step Or unit, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment Or unit.

It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two More than a.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the module, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or unit Letter connection can be electrical property, mechanical or other forms.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation: RAM), the various media that can store program code such as magnetic or disk.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of audio data cleaning method based on speaker's identity characterized by comprising

The original audio data of acquisition is decoded；

Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people；

First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, by the Two audio datas are labeled, and the second audio data includes the first audio data that multistage belongs to same people.

2. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that dividing After separating out effective audio in the decoded audio data, further includes:

Languages identification is carried out to effective audio, and isolates effective audio of required languages.

3. according to the method described in claim 2, it is characterized in that, described carry out languages identification to the effective audio, and mentioning The audio of languages needed for taking out includes:

By effective audio input into languages identification model, required languages are identified, and extract the sound of corresponding languages Frequently.

4. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding single in the audio of several segmentations People includes:

Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several audios being segmented, Every section audio corresponds to single people in several audios.

5. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described Words person's separation is carried out to the audio of several segmentations, is isolated and is belonged to the first audio data of the same person and include:

Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates the first audio for belonging to the same person The audio fragment for belonging to the same person is divided into one kind by data.

6. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described First audio data corresponding in multistage original audio is subjected to vocal print cluster, second audio data is obtained, by the second sound For frequency according to being labeled, the second audio data includes the first audio data that multistage belongs to same people specifically:

First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, obtains the Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people Frequency evidence.

7. a kind of audio data cleaning device based on speaker's identity characterized by comprising

Audio separation module, the audio separation module are used to isolate effective audio in the decoded audio data, institute Stating effective audio is the segment comprising voice；

Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described several Every section audio corresponds to single people in the audio of segmentation；

Words person's separation module, words person's separation module are isolated for carrying out words person's separation to the audio of several segmentations Belong to the first audio data of the same person；

Vocal print cluster module, the vocal print cluster module be used for by first audio data corresponding in multistage original audio into Row vocal print cluster, obtains second audio data, second audio data is labeled, the second audio data includes multistage category In the first audio data of same people.

8. a kind of audio data cleaning device based on speaker's identity according to claim 7, which is characterized in that further include:

Languages separation module, the languages separation module is used to carry out languages identification to effective audio, and isolates required Effective audio of languages.

9. a kind of audio data cleaning equipment based on speaker's identity, which is characterized in that the equipment include processor and Memory:

The processor according to instruction execution one kind described in any one of claims 1-6 in said program code for being based on The audio data cleaning method of speaker's identity.

10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation Code, said program code require a kind of described in any item audio datas based on speaker's identity of 1-6 clear for perform claim Washing method.