CN110491392A - A kind of audio data cleaning method, device and equipment based on speaker's identity - Google Patents
A kind of audio data cleaning method, device and equipment based on speaker's identity Download PDFInfo
- Publication number
- CN110491392A CN110491392A CN201910809574.6A CN201910809574A CN110491392A CN 110491392 A CN110491392 A CN 110491392A CN 201910809574 A CN201910809574 A CN 201910809574A CN 110491392 A CN110491392 A CN 110491392A
- Authority
- CN
- China
- Prior art keywords
- audio
- audio data
- effective
- speaker
- identity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000004140 cleaning Methods 0.000 title claims abstract description 32
- 230000011218 segmentation Effects 0.000 claims abstract description 52
- 230000001755 vocal effect Effects 0.000 claims abstract description 41
- 238000000926 separation method Methods 0.000 claims abstract description 39
- 239000012634 fragment Substances 0.000 claims abstract description 16
- 238000003860 storage Methods 0.000 claims description 11
- 241001269238 Data Species 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 238000005406 washing Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000005236 sound signal Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 241000122205 Chamaeleonidae Species 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000011435 rock Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the present application discloses a kind of audio data cleaning method, device and equipment based on speaker's identity, comprising: is decoded to the original audio data of acquisition;Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;Effective audio described in segment processing obtains the audio of several segmentations, and every section audio corresponds to single people in the audio of several segmentations;Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;Vocal print cluster will be carried out by corresponding first audio data from multistage original audio, obtain second audio data, second audio data is labeled.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, and the segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.
Description
Technical field
This application involves audio data processing technology field more particularly to a kind of audio data based on speaker's identity are clear
Washing method, device and equipment.
Background technique
Information technology development it is current, there is a large amount of voice data to generate daily, they are in telephonograph, network
Very important role is play in the service applications of multiple fields such as phone, wechat call.However, due to the acquisition side of data
Formula is varied, often occurs accompanying that noise, mute time are long, situations such as being mingled with foreign language, speaking there are more people, this kind of data without
Method directly uses, it is necessary to carry out data cleansing processing.
Have benefited from the fast development of the information technologies such as cloud computing, big data, deep learning in recent years, vocal print technology has been sent out
Open up more mature, recognition accuracy is also relatively high, and application range is also relatively broad.In addition to the most common sound groove recognition technology in e
Outside, the new technologies such as languages identification, words person's separation, vocal print cluster occur successively, also provide more for the data cleansing of fining
Technical conditions, however there are no the voice messagings for isolating different people in audio that can be refined in existing technology.
Summary of the invention
The embodiment of the present application provides a kind of audio data cleaning method, device and equipment based on speaker's identity, leads to
It crosses and segment processing is carried out to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered,
The segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.
In view of this, the application first aspect provides a kind of audio data cleaning method based on speaker's identity, institute
The method of stating includes:
The original audio data of acquisition is decoded;
Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;
Effective audio described in segment processing obtains the audio of several segmentations, every section audio in the audio of several segmentations
Corresponding single people;
Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;
First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data,
Second audio data is labeled, the second audio data includes the first audio data that multistage belongs to same people.
Preferably, after isolating effective audio in the decoded audio data, further includes: to effective sound
Frequency carries out languages identification, and isolates effective audio of required languages.
Preferably, described that languages identification is carried out to effective audio, and the audio for extracting required languages includes: by institute
Effective audio input is stated into languages identification model, identifies required languages, and extracts the audio of corresponding languages.
Preferably, effective audio described in the segment processing obtains the audio of several segmentations, the audio of several segmentations
In every section audio correspond to single people and include:
Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several sounds being segmented
Frequently, every section audio corresponds to single people in several audios.
Preferably, the audio to several segmentations carries out words person's separation, isolates and belongs to the first of the same person
Audio data includes:
Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates and belongs to the first of the same person
The audio fragment for belonging to the same person is divided into one kind by audio data.
Preferably, described that first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtain the
Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people
Frequency evidence specifically:
First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, is obtained
To second audio data, second audio data is labeled, the second audio data includes that multistage belongs to same people
One audio data.
The application second aspect provides a kind of audio data cleaning device based on speaker's identity, and described device includes:
Decoder module, the decoder module is for being decoded the original audio data of acquisition;
Audio separation module, the audio separation module are used to isolate effective sound in the decoded audio data
Frequently, effective audio is the segment comprising voice;
Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described
Every section audio corresponds to single people in the audio of several segmentations;
Words person's separation module, words person's separation module divide for carrying out words person's separation to the audio of several segmentations
Separate out the first audio data for belonging to the same person;
Vocal print cluster module, the vocal print cluster module are used for the first audio number corresponding in multistage original audio
According to vocal print cluster is carried out, second audio data is obtained, second audio data is labeled, the second audio data includes more
Section belongs to the first audio data of same people.
Preferably, further includes: languages separation module, the languages separation module are used to carry out languages to effective audio
Identification, and isolate effective audio of required languages.
The application third aspect provides a kind of audio data cleaning equipment based on speaker's identity, and the equipment includes place
Manage device and memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used to execute a kind of base as described in above-mentioned first aspect according to the instruction in said program code
In the audio data cleaning method of speaker's identity the step of.
The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is for depositing
Program code is stored up, said program code is for executing method described in above-mentioned first aspect.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
The application provides a kind of audio data cleaning method based on speaker's identity, first to the original audio number of acquisition
According to being decoded;Isolate effective audio in decoded audio data, wherein effective audio is the segment comprising voice;Point
Section handles effective audio, obtains the audio of several segmentations, wherein every section audio corresponds to single people in the audio of several segmentations;It is right
The audio of several segmentations carries out words person's separation, isolates the first audio data for belonging to the same person;By the original sound of multistage
Corresponding first audio data carries out vocal print cluster in frequency, obtains second audio data, second audio data is marked
Note.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then by audio piece
Duan Jinhang cluster, the segment for belonging to the same person is polymerize, so that the voice for isolating different people in audio of fining
Information.
Detailed description of the invention
Fig. 1 is a kind of method flow of one embodiment of the audio data cleaning method based on speaker's identity of the application
Figure;
Fig. 2 is a kind of method stream of another embodiment of the audio data cleaning method based on speaker's identity of the application
Cheng Tu;
Fig. 3 is words person's separation in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application
Schematic diagram;
Fig. 4 is vocal print cluster in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application
The schematic diagram of process;
Fig. 5 is a kind of apparatus structure of one embodiment of the audio data cleaning device based on speaker's identity of the application
Figure.
Specific embodiment
The embodiment of the present application provides a kind of audio data cleaning method, device and equipment based on speaker's identity, leads to
It crosses and segment processing is carried out to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered,
The segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this
Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
In order to make it easy to understand, referring to Fig. 1, Fig. 1 is a kind of audio data cleaning side based on speaker's identity of the application
The method flow diagram of one embodiment of method, as shown in Figure 1, including: in Fig. 1
101, the original audio data of acquisition is decoded.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume
Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format,
This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice
Formula can be wav format.
102, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data
Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give
Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place
Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection
The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted
It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
103, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations
Single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root
Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
104, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.
It should be noted that the audio to several segmentations carries out words person's separation, the first audio number of the same person will be belonged to
According to separating, algorithm used by separating can be HAC clustering algorithm, can also be gathered using other clustering algorithms
It closes, so that being aggregated to the audio fragment for belonging to the same speaker together from the audio of several segmentations, thus by different theorys
The audio of words people separates.
105, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data,
Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing,
In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage
Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number
According to being labeled, speaker a or vocal print a can be labeled as.
For the embodiment of the present application by carrying out segment processing to audio data, obtained every section audio segment is all corresponding single
People, then audio fragment is clustered, the segment for belonging to the same person is polymerize, so that fining is isolated in audio
The voice messaging of different people.
In order to make it easy to understand, referring to Fig. 2, Fig. 2 is a kind of audio data cleaning side based on speaker's identity of the application
The method flow diagram of another embodiment of method, as shown in Fig. 2, specifically:
201, the original audio data of acquisition is decoded.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume
Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format,
This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice
Formula can be wav format.
202, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data
Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give
Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place
Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection
The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted
It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
203, languages identification is carried out to effective audio, and isolates effective audio of required languages.
It should be noted that effectively audio can be the voice that Chinese speech is also possible to non-Chinese, therefore can be directed to
Required languages are screened, and the languages of needs are selected.
In a preferred embodiment, required languages can be identified by effective audio input into languages identification model,
And extract the audio of corresponding languages.
It should be noted that languages identification model can be the model comprising neural network, which can pass through extraction
The phonetic feature of Chinese and non-Chinese speech, and will be input in the model and instruct including the speech samples of Chinese and non-Chinese
Languages identification model is got, for example, effective audio input to be measured to languages is identified mould when needing to carry out languages separation
In type, effective audio frequency characteristics are extracted, and score judgement is carried out to effective audio by languages identification model, it is right by score height
Effective audio is classified, and judges that effective audio is that belong to Chinese audio also be non-Chinese audio.
204, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations
Single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root
Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
In a preferred embodiment, effective audio input of separation is subjected to segment processing into words person's disjunctive model, obtained
To several audios of segmentation, every section audio corresponds to single people in several audios.
It should be noted that words person's disjunctive model can be the model comprising neural network, such as deep neural network DNN
Model can extract speaker and switch the feature of turning point, and is input in DNN model and instructed by a large amount of audio datas
Practice, obtains words person's disjunctive model, cut effective audio input to be measured into DNN model, is extracted speaker in effective audio
The turning point changed, and then effective audio is cut according to turning point, several audios being segmented, wherein what is be segmented is several
Every section audio corresponds to single people in audio.
205, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.
It should be noted that carrying out vocal print cluster to the audio of several segmentations, the clustering algorithm that can be used includes HAC poly-
Class algorithm can also be clustered using other clustering algorithms, so that will belong to the same theory from the audio of several segmentations
The audio fragment of words people is aggregated to together, so that the audio of different speakers be separated.
In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 3
One embodiment in cluster process schematic diagram, comprising: the audio sample used in figure includes the audio sample of two speakers
This, using deep learning DNN model as words person's disjunctive model, the clustering algorithm used is HAC clustering algorithm.
Original audio is input to based on deep learning DNN in person's disjunctive model first, by audio sample to be measured
It is input in DNN model;Extract the turning point that speaker in effective audio switches, so according to turning point to effective audio into
Row cutting, several audios being segmented, wherein every section audio corresponds to single people in several audios being segmented;It is clustered using HAC
Several audios of segmentation polymerize according to timing by algorithm respectively obtains the speech samples for belonging to two people.
206, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data,
Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing,
In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage
Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number
According to being labeled, speaker a or vocal print a can be labeled as.
It is preferably carried out in mode in one kind, according to hierarchical clustering algorithm by the first audio corresponding in multistage original audio
Data carry out vocal print cluster, obtain second audio data, second audio data is labeled, second audio data includes multistage
Belong to the first audio data of same people.
It should be noted that hierarchical clustering algorithm therein can be CURE (Clustering Using
Representatives, using the clustering algorithm of representative), can also be such as AGNES (AGglomerative NESting,
Agglomerate nested algorithm) algorithm, BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies, using layering equilibrium iteration reduce and cluster algorithm), ROCK (Robust Clustering using
Links, using the healthy and strong swarm algorithm of link) or the hierarchical clustering algorithms such as CHAMELEON (chameleon algorithm).
In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 4
One embodiment in vocal print cluster process schematic diagram, wherein use including 4 section of first audio data, and use CURE
Hierarchical clustering algorithm, certainly practical the first audio data that may include multistage in the middle, the hierarchical clustering algorithm used can also
To be other hierarchical clustering algorithms.
4 section audio data are clustered using CURE hierarchical clustering algorithm first, that is, calculate the phase between different audios
Like property, similarity is greater than preset threshold value, then is classified as one kind, and such is labeled, such as be labeled as speaker A
With speaker B.
In order to make it easy to understand, referring to Fig. 5, Fig. 5 is that a kind of audio data based on speaker's identity of the application cleans dress
The structure drawing of device of the one embodiment set, as shown in figure 5, specifically including:
Decoder module 501, for being decoded to the original audio data of acquisition.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume
Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format,
This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice
Formula can be wav format.
Audio separation module 502, for isolating effective audio in decoded audio data, effective audio is to include people
The segment of sound.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data
Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give
Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place
Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection
The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted
It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
Segmentation module 503 is used for the effective audio of segment processing, obtains the audio of several segmentations, in the audio of several segmentations
Every section audio corresponds to single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root
Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
Words person's separation module 504 is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the same person's
First audio data.
It should be noted that the audio to several segmentations carries out words person's separation, the first audio number of the same person will be belonged to
According to separating, algorithm used by separating can be HAC clustering algorithm, can also be gathered using other clustering algorithms
It closes, so that being aggregated to the audio fragment for belonging to the same speaker together from the audio of several segmentations, thus by different theorys
The audio of words people separates.
Vocal print cluster module 505 is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster
To second audio data, second audio data is labeled, second audio data includes the first sound that multistage belongs to same people
Frequency evidence.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing,
In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage
Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number
According to being labeled, speaker a or vocal print a can be labeled as.
For the embodiment of the present application by carrying out segment processing to audio data, obtained every section audio segment is all corresponding single
People, then audio fragment is clustered, the segment for belonging to the same person is polymerize, so that fining is isolated in audio
The voice messaging of different people.
The embodiment of the present application also provides a kind of another implementation of audio data cleaning device based on speaker's identity
Example, specifically includes:
Decoder module, for being decoded to the original audio data of acquisition;
Audio separation module, for isolating effective audio in decoded audio data, effective audio is to include voice
Segment;
Languages separation module for carrying out languages identification to effective audio, and isolates effective audio of required languages.
Segmentation module is used for the effective audio of segment processing, obtains the audio of several segmentations, every section in the audio of several segmentations
Audio corresponds to single people;
Words person's separation module is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the of the same person
One audio data;
Vocal print cluster module is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster
Second audio data is labeled by second audio data, and second audio data includes the first audio that multistage belongs to same people
Data.
The embodiment of the present application also provides a kind of audio data cleaning equipment based on speaker's identity, and equipment includes processor
And memory: program code is transferred to processor for storing program code by memory;Processor is used for according to program
A kind of the step of instruction in code, execution such as audio data cleaning method based on speaker's identity of above-described embodiment.
The embodiment of the present application also provides a kind of computer readable storage medium, for storing program code, the program code
Any one implementation in a kind of audio data cleaning method based on speaker's identity for executing foregoing individual embodiments
Mode.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
The description of the present application and term " first " in above-mentioned attached drawing, " second " etc. are to be used to distinguish similar objects,
Without being used to describe a particular order or precedence order.It should be understood that the data used in this way in the appropriate case can be mutual
It changes, so that embodiments herein described herein for example can be with the sequence other than those of illustrating or describing herein
Implement.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example,
The process, method, system, product or equipment for containing a series of steps or units those of are not necessarily limited to be clearly listed step
Or unit, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment
Or unit.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two
More than a.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the module, only
Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components can be tied
Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed
Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or unit
Letter connection can be electrical property, mechanical or other forms.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application
Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only
Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation:
RAM), the various media that can store program code such as magnetic or disk.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before
Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. a kind of audio data cleaning method based on speaker's identity characterized by comprising
The original audio data of acquisition is decoded;
Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;
Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations
Single people;
Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;
First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, by the
Two audio datas are labeled, and the second audio data includes the first audio data that multistage belongs to same people.
2. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that dividing
After separating out effective audio in the decoded audio data, further includes:
Languages identification is carried out to effective audio, and isolates effective audio of required languages.
3. according to the method described in claim 2, it is characterized in that, described carry out languages identification to the effective audio, and mentioning
The audio of languages needed for taking out includes:
By effective audio input into languages identification model, required languages are identified, and extract the sound of corresponding languages
Frequently.
4. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described
Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding single in the audio of several segmentations
People includes:
Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several audios being segmented,
Every section audio corresponds to single people in several audios.
5. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described
Words person's separation is carried out to the audio of several segmentations, is isolated and is belonged to the first audio data of the same person and include:
Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates the first audio for belonging to the same person
The audio fragment for belonging to the same person is divided into one kind by data.
6. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described
First audio data corresponding in multistage original audio is subjected to vocal print cluster, second audio data is obtained, by the second sound
For frequency according to being labeled, the second audio data includes the first audio data that multistage belongs to same people specifically:
First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, obtains the
Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people
Frequency evidence.
7. a kind of audio data cleaning device based on speaker's identity characterized by comprising
Decoder module, the decoder module is for being decoded the original audio data of acquisition;
Audio separation module, the audio separation module are used to isolate effective audio in the decoded audio data, institute
Stating effective audio is the segment comprising voice;
Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described several
Every section audio corresponds to single people in the audio of segmentation;
Words person's separation module, words person's separation module are isolated for carrying out words person's separation to the audio of several segmentations
Belong to the first audio data of the same person;
Vocal print cluster module, the vocal print cluster module be used for by first audio data corresponding in multistage original audio into
Row vocal print cluster, obtains second audio data, second audio data is labeled, the second audio data includes multistage category
In the first audio data of same people.
8. a kind of audio data cleaning device based on speaker's identity according to claim 7, which is characterized in that further include:
Languages separation module, the languages separation module is used to carry out languages identification to effective audio, and isolates required
Effective audio of languages.
9. a kind of audio data cleaning equipment based on speaker's identity, which is characterized in that the equipment include processor and
Memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor according to instruction execution one kind described in any one of claims 1-6 in said program code for being based on
The audio data cleaning method of speaker's identity.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation
Code, said program code require a kind of described in any item audio datas based on speaker's identity of 1-6 clear for perform claim
Washing method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910809574.6A CN110491392A (en) | 2019-08-29 | 2019-08-29 | A kind of audio data cleaning method, device and equipment based on speaker's identity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910809574.6A CN110491392A (en) | 2019-08-29 | 2019-08-29 | A kind of audio data cleaning method, device and equipment based on speaker's identity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110491392A true CN110491392A (en) | 2019-11-22 |
Family
ID=68553683
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910809574.6A Pending CN110491392A (en) | 2019-08-29 | 2019-08-29 | A kind of audio data cleaning method, device and equipment based on speaker's identity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110491392A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992989A (en) * | 2019-12-06 | 2020-04-10 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN111354346A (en) * | 2020-03-30 | 2020-06-30 | 上海依图信息技术有限公司 | Voice recognition data expansion method and system |
CN111508503A (en) * | 2020-06-16 | 2020-08-07 | 北京爱数智慧科技有限公司 | Method and device for identifying same speaker |
CN111524527A (en) * | 2020-04-30 | 2020-08-11 | 合肥讯飞数码科技有限公司 | Speaker separation method, device, electronic equipment and storage medium |
CN111785291A (en) * | 2020-07-02 | 2020-10-16 | 北京捷通华声科技股份有限公司 | Voice separation method and voice separation device |
CN111899743A (en) * | 2020-07-31 | 2020-11-06 | 斑马网络技术有限公司 | Method and device for acquiring target sound, electronic equipment and storage medium |
CN112165599A (en) * | 2020-10-10 | 2021-01-01 | 广州科天视畅信息科技有限公司 | Automatic conference summary generation method for video conference |
CN112837690A (en) * | 2020-12-30 | 2021-05-25 | 科大讯飞股份有限公司 | Audio data generation method, audio data transcription method and device |
CN112966090A (en) * | 2021-03-30 | 2021-06-15 | 思必驰科技股份有限公司 | Dialogue audio data processing method, electronic device, and computer-readable storage medium |
CN113593578A (en) * | 2021-09-03 | 2021-11-02 | 北京紫涓科技有限公司 | Conference voice data acquisition method and system |
CN113891177A (en) * | 2021-05-31 | 2022-01-04 | 多益网络有限公司 | Method, device, equipment and storage medium for generating abstract of audio and video data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102543080A (en) * | 2010-12-24 | 2012-07-04 | 索尼公司 | Audio editing system and audio editing method |
CN103400580A (en) * | 2013-07-23 | 2013-11-20 | 华南理工大学 | Method for estimating importance degree of speaker in multiuser session voice |
CN105161093A (en) * | 2015-10-14 | 2015-12-16 | 科大讯飞股份有限公司 | Method and system for determining the number of speakers |
CN106251874A (en) * | 2016-07-27 | 2016-12-21 | 深圳市鹰硕音频科技有限公司 | A kind of voice gate inhibition and quiet environment monitoring method and system |
CN106683662A (en) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | Speech recognition method and device |
CN106887231A (en) * | 2015-12-16 | 2017-06-23 | 芋头科技(杭州)有限公司 | A kind of identification model update method and system and intelligent terminal |
-
2019
- 2019-08-29 CN CN201910809574.6A patent/CN110491392A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102543080A (en) * | 2010-12-24 | 2012-07-04 | 索尼公司 | Audio editing system and audio editing method |
CN103400580A (en) * | 2013-07-23 | 2013-11-20 | 华南理工大学 | Method for estimating importance degree of speaker in multiuser session voice |
CN105161093A (en) * | 2015-10-14 | 2015-12-16 | 科大讯飞股份有限公司 | Method and system for determining the number of speakers |
CN106683662A (en) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | Speech recognition method and device |
CN106887231A (en) * | 2015-12-16 | 2017-06-23 | 芋头科技(杭州)有限公司 | A kind of identification model update method and system and intelligent terminal |
CN106251874A (en) * | 2016-07-27 | 2016-12-21 | 深圳市鹰硕音频科技有限公司 | A kind of voice gate inhibition and quiet environment monitoring method and system |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992989A (en) * | 2019-12-06 | 2020-04-10 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN110992989B (en) * | 2019-12-06 | 2022-05-27 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN111354346B (en) * | 2020-03-30 | 2023-04-07 | 上海依图信息技术有限公司 | Voice recognition data expansion method and system |
CN111354346A (en) * | 2020-03-30 | 2020-06-30 | 上海依图信息技术有限公司 | Voice recognition data expansion method and system |
CN111524527A (en) * | 2020-04-30 | 2020-08-11 | 合肥讯飞数码科技有限公司 | Speaker separation method, device, electronic equipment and storage medium |
CN111524527B (en) * | 2020-04-30 | 2023-08-22 | 合肥讯飞数码科技有限公司 | Speaker separation method, speaker separation device, electronic device and storage medium |
CN111508503A (en) * | 2020-06-16 | 2020-08-07 | 北京爱数智慧科技有限公司 | Method and device for identifying same speaker |
CN111785291A (en) * | 2020-07-02 | 2020-10-16 | 北京捷通华声科技股份有限公司 | Voice separation method and voice separation device |
CN111899743A (en) * | 2020-07-31 | 2020-11-06 | 斑马网络技术有限公司 | Method and device for acquiring target sound, electronic equipment and storage medium |
CN112165599A (en) * | 2020-10-10 | 2021-01-01 | 广州科天视畅信息科技有限公司 | Automatic conference summary generation method for video conference |
CN112837690A (en) * | 2020-12-30 | 2021-05-25 | 科大讯飞股份有限公司 | Audio data generation method, audio data transcription method and device |
CN112837690B (en) * | 2020-12-30 | 2024-04-16 | 科大讯飞股份有限公司 | Audio data generation method, audio data transfer method and device |
CN112966090A (en) * | 2021-03-30 | 2021-06-15 | 思必驰科技股份有限公司 | Dialogue audio data processing method, electronic device, and computer-readable storage medium |
CN112966090B (en) * | 2021-03-30 | 2022-07-12 | 思必驰科技股份有限公司 | Dialogue audio data processing method, electronic device, and computer-readable storage medium |
CN113891177A (en) * | 2021-05-31 | 2022-01-04 | 多益网络有限公司 | Method, device, equipment and storage medium for generating abstract of audio and video data |
CN113891177B (en) * | 2021-05-31 | 2024-01-05 | 多益网络有限公司 | Abstract generation method, device, equipment and storage medium of audio and video data |
CN113593578A (en) * | 2021-09-03 | 2021-11-02 | 北京紫涓科技有限公司 | Conference voice data acquisition method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491392A (en) | A kind of audio data cleaning method, device and equipment based on speaker's identity | |
CN109522556B (en) | Intention recognition method and device | |
US10666792B1 (en) | Apparatus and method for detecting new calls from a known robocaller and identifying relationships among telephone calls | |
CN101447185B (en) | Audio frequency rapid classification method based on content | |
CN103700370A (en) | Broadcast television voice recognition method and system | |
CN106503805A (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method | |
CN106547789B (en) | Lyric generation method and device | |
CN110517667A (en) | A kind of method of speech processing, device, electronic equipment and storage medium | |
CN111369981B (en) | Dialect region identification method and device, electronic equipment and storage medium | |
WO2022134798A1 (en) | Segmentation method, apparatus and device based on natural language, and storage medium | |
CN102779510A (en) | Speech emotion recognition method based on feature space self-adaptive projection | |
CN109192225A (en) | The method and device of speech emotion recognition and mark | |
CN109871686A (en) | Rogue program recognition methods and device based on icon representation and software action consistency analysis | |
CN102915729A (en) | Speech keyword spotting system and system and method of creating dictionary for the speech keyword spotting system | |
CN108520752A (en) | A kind of method for recognizing sound-groove and device | |
CN109800418A (en) | Text handling method, device and storage medium | |
CN112861984A (en) | Speech emotion classification method based on feature fusion and ensemble learning | |
CN102201237A (en) | Emotional speaker identification method based on reliability detection of fuzzy support vector machine | |
Van Leeuwen | Speaker linking in large data sets | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
Anders et al. | Compensating class imbalance for acoustic chimpanzee detection with convolutional recurrent neural networks | |
CN106227720B (en) | A kind of APP software users comment mode identification method | |
KR101478146B1 (en) | Apparatus and method for recognizing speech based on speaker group | |
CN102541935A (en) | Novel Chinese Web document representing method based on characteristic vectors | |
KR101092352B1 (en) | Method and apparatus for automatic classification of sentence corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191122 |