CN110491392A - A kind of audio data cleaning method, device and equipment based on speaker's identity - Google Patents

A kind of audio data cleaning method, device and equipment based on speaker's identity Download PDF

Info

Publication number
CN110491392A
CN110491392A CN201910809574.6A CN201910809574A CN110491392A CN 110491392 A CN110491392 A CN 110491392A CN 201910809574 A CN201910809574 A CN 201910809574A CN 110491392 A CN110491392 A CN 110491392A
Authority
CN
China
Prior art keywords
audio
audio data
effective
speaker
identity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910809574.6A
Other languages
Chinese (zh)
Inventor
许敏强
杨世清
刘敏
蒋敬
王泽龙
张露露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou National Acoustic Intelligent Technology Co Ltd
Original Assignee
Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou National Acoustic Intelligent Technology Co Ltd filed Critical Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority to CN201910809574.6A priority Critical patent/CN110491392A/en
Publication of CN110491392A publication Critical patent/CN110491392A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the present application discloses a kind of audio data cleaning method, device and equipment based on speaker's identity, comprising: is decoded to the original audio data of acquisition;Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;Effective audio described in segment processing obtains the audio of several segmentations, and every section audio corresponds to single people in the audio of several segmentations;Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;Vocal print cluster will be carried out by corresponding first audio data from multistage original audio, obtain second audio data, second audio data is labeled.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, and the segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.

Description

A kind of audio data cleaning method, device and equipment based on speaker's identity
Technical field
This application involves audio data processing technology field more particularly to a kind of audio data based on speaker's identity are clear Washing method, device and equipment.
Background technique
Information technology development it is current, there is a large amount of voice data to generate daily, they are in telephonograph, network Very important role is play in the service applications of multiple fields such as phone, wechat call.However, due to the acquisition side of data Formula is varied, often occurs accompanying that noise, mute time are long, situations such as being mingled with foreign language, speaking there are more people, this kind of data without Method directly uses, it is necessary to carry out data cleansing processing.
Have benefited from the fast development of the information technologies such as cloud computing, big data, deep learning in recent years, vocal print technology has been sent out Open up more mature, recognition accuracy is also relatively high, and application range is also relatively broad.In addition to the most common sound groove recognition technology in e Outside, the new technologies such as languages identification, words person's separation, vocal print cluster occur successively, also provide more for the data cleansing of fining Technical conditions, however there are no the voice messagings for isolating different people in audio that can be refined in existing technology.
Summary of the invention
The embodiment of the present application provides a kind of audio data cleaning method, device and equipment based on speaker's identity, leads to It crosses and segment processing is carried out to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, The segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.
In view of this, the application first aspect provides a kind of audio data cleaning method based on speaker's identity, institute The method of stating includes:
The original audio data of acquisition is decoded;
Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;
Effective audio described in segment processing obtains the audio of several segmentations, every section audio in the audio of several segmentations Corresponding single people;
Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;
First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, the second audio data includes the first audio data that multistage belongs to same people.
Preferably, after isolating effective audio in the decoded audio data, further includes: to effective sound Frequency carries out languages identification, and isolates effective audio of required languages.
Preferably, described that languages identification is carried out to effective audio, and the audio for extracting required languages includes: by institute Effective audio input is stated into languages identification model, identifies required languages, and extracts the audio of corresponding languages.
Preferably, effective audio described in the segment processing obtains the audio of several segmentations, the audio of several segmentations In every section audio correspond to single people and include:
Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several sounds being segmented Frequently, every section audio corresponds to single people in several audios.
Preferably, the audio to several segmentations carries out words person's separation, isolates and belongs to the first of the same person Audio data includes:
Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates and belongs to the first of the same person The audio fragment for belonging to the same person is divided into one kind by audio data.
Preferably, described that first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtain the Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people Frequency evidence specifically:
First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, is obtained To second audio data, second audio data is labeled, the second audio data includes that multistage belongs to same people One audio data.
The application second aspect provides a kind of audio data cleaning device based on speaker's identity, and described device includes:
Decoder module, the decoder module is for being decoded the original audio data of acquisition;
Audio separation module, the audio separation module are used to isolate effective sound in the decoded audio data Frequently, effective audio is the segment comprising voice;
Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described Every section audio corresponds to single people in the audio of several segmentations;
Words person's separation module, words person's separation module divide for carrying out words person's separation to the audio of several segmentations Separate out the first audio data for belonging to the same person;
Vocal print cluster module, the vocal print cluster module are used for the first audio number corresponding in multistage original audio According to vocal print cluster is carried out, second audio data is obtained, second audio data is labeled, the second audio data includes more Section belongs to the first audio data of same people.
Preferably, further includes: languages separation module, the languages separation module are used to carry out languages to effective audio Identification, and isolate effective audio of required languages.
The application third aspect provides a kind of audio data cleaning equipment based on speaker's identity, and the equipment includes place Manage device and memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used to execute a kind of base as described in above-mentioned first aspect according to the instruction in said program code In the audio data cleaning method of speaker's identity the step of.
The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is for depositing Program code is stored up, said program code is for executing method described in above-mentioned first aspect.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
The application provides a kind of audio data cleaning method based on speaker's identity, first to the original audio number of acquisition According to being decoded;Isolate effective audio in decoded audio data, wherein effective audio is the segment comprising voice;Point Section handles effective audio, obtains the audio of several segmentations, wherein every section audio corresponds to single people in the audio of several segmentations;It is right The audio of several segmentations carries out words person's separation, isolates the first audio data for belonging to the same person;By the original sound of multistage Corresponding first audio data carries out vocal print cluster in frequency, obtains second audio data, second audio data is marked Note.For the application by carrying out segment processing to audio data, obtained every section audio segment all corresponds to single people, then by audio piece Duan Jinhang cluster, the segment for belonging to the same person is polymerize, so that the voice for isolating different people in audio of fining Information.
Detailed description of the invention
Fig. 1 is a kind of method flow of one embodiment of the audio data cleaning method based on speaker's identity of the application Figure;
Fig. 2 is a kind of method stream of another embodiment of the audio data cleaning method based on speaker's identity of the application Cheng Tu;
Fig. 3 is words person's separation in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application Schematic diagram;
Fig. 4 is vocal print cluster in a kind of one embodiment of the audio data cleaning method based on speaker's identity of the application The schematic diagram of process;
Fig. 5 is a kind of apparatus structure of one embodiment of the audio data cleaning device based on speaker's identity of the application Figure.
Specific embodiment
The embodiment of the present application provides a kind of audio data cleaning method, device and equipment based on speaker's identity, leads to It crosses and segment processing is carried out to audio data, obtained every section audio segment all corresponds to single people, then audio fragment is clustered, The segment for belonging to the same person is polymerize, so that the voice messaging for isolating different people in audio of fining.
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
In order to make it easy to understand, referring to Fig. 1, Fig. 1 is a kind of audio data cleaning side based on speaker's identity of the application The method flow diagram of one embodiment of method, as shown in Figure 1, including: in Fig. 1
101, the original audio data of acquisition is decoded.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format, This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice Formula can be wav format.
102, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
103, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
104, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.
It should be noted that the audio to several segmentations carries out words person's separation, the first audio number of the same person will be belonged to According to separating, algorithm used by separating can be HAC clustering algorithm, can also be gathered using other clustering algorithms It closes, so that being aggregated to the audio fragment for belonging to the same speaker together from the audio of several segmentations, thus by different theorys The audio of words people separates.
105, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing, In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number According to being labeled, speaker a or vocal print a can be labeled as.
For the embodiment of the present application by carrying out segment processing to audio data, obtained every section audio segment is all corresponding single People, then audio fragment is clustered, the segment for belonging to the same person is polymerize, so that fining is isolated in audio The voice messaging of different people.
In order to make it easy to understand, referring to Fig. 2, Fig. 2 is a kind of audio data cleaning side based on speaker's identity of the application The method flow diagram of another embodiment of method, as shown in Fig. 2, specifically:
201, the original audio data of acquisition is decoded.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format, This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice Formula can be wav format.
202, effective audio in decoded audio data is isolated, effective audio is the segment comprising voice.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
203, languages identification is carried out to effective audio, and isolates effective audio of required languages.
It should be noted that effectively audio can be the voice that Chinese speech is also possible to non-Chinese, therefore can be directed to Required languages are screened, and the languages of needs are selected.
In a preferred embodiment, required languages can be identified by effective audio input into languages identification model, And extract the audio of corresponding languages.
It should be noted that languages identification model can be the model comprising neural network, which can pass through extraction The phonetic feature of Chinese and non-Chinese speech, and will be input in the model and instruct including the speech samples of Chinese and non-Chinese Languages identification model is got, for example, effective audio input to be measured to languages is identified mould when needing to carry out languages separation In type, effective audio frequency characteristics are extracted, and score judgement is carried out to effective audio by languages identification model, it is right by score height Effective audio is classified, and judges that effective audio is that belong to Chinese audio also be non-Chinese audio.
204, the effective audio of segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
In a preferred embodiment, effective audio input of separation is subjected to segment processing into words person's disjunctive model, obtained To several audios of segmentation, every section audio corresponds to single people in several audios.
It should be noted that words person's disjunctive model can be the model comprising neural network, such as deep neural network DNN Model can extract speaker and switch the feature of turning point, and is input in DNN model and instructed by a large amount of audio datas Practice, obtains words person's disjunctive model, cut effective audio input to be measured into DNN model, is extracted speaker in effective audio The turning point changed, and then effective audio is cut according to turning point, several audios being segmented, wherein what is be segmented is several Every section audio corresponds to single people in audio.
205, words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person.
It should be noted that carrying out vocal print cluster to the audio of several segmentations, the clustering algorithm that can be used includes HAC poly- Class algorithm can also be clustered using other clustering algorithms, so that will belong to the same theory from the audio of several segmentations The audio fragment of words people is aggregated to together, so that the audio of different speakers be separated.
In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 3 One embodiment in cluster process schematic diagram, comprising: the audio sample used in figure includes the audio sample of two speakers This, using deep learning DNN model as words person's disjunctive model, the clustering algorithm used is HAC clustering algorithm.
Original audio is input to based on deep learning DNN in person's disjunctive model first, by audio sample to be measured It is input in DNN model;Extract the turning point that speaker in effective audio switches, so according to turning point to effective audio into Row cutting, several audios being segmented, wherein every section audio corresponds to single people in several audios being segmented;It is clustered using HAC Several audios of segmentation polymerize according to timing by algorithm respectively obtains the speech samples for belonging to two people.
206, the first audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, Second audio data is labeled, second audio data includes the first audio data that multistage belongs to same people.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing, In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number According to being labeled, speaker a or vocal print a can be labeled as.
It is preferably carried out in mode in one kind, according to hierarchical clustering algorithm by the first audio corresponding in multistage original audio Data carry out vocal print cluster, obtain second audio data, second audio data is labeled, second audio data includes multistage Belong to the first audio data of same people.
It should be noted that hierarchical clustering algorithm therein can be CURE (Clustering Using Representatives, using the clustering algorithm of representative), can also be such as AGNES (AGglomerative NESting, Agglomerate nested algorithm) algorithm, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies, using layering equilibrium iteration reduce and cluster algorithm), ROCK (Robust Clustering using Links, using the healthy and strong swarm algorithm of link) or the hierarchical clustering algorithms such as CHAMELEON (chameleon algorithm).
In a kind of specific embodiment, a kind of audio data cleaning method based on speaker's identity as shown in Figure 4 One embodiment in vocal print cluster process schematic diagram, wherein use including 4 section of first audio data, and use CURE Hierarchical clustering algorithm, certainly practical the first audio data that may include multistage in the middle, the hierarchical clustering algorithm used can also To be other hierarchical clustering algorithms.
4 section audio data are clustered using CURE hierarchical clustering algorithm first, that is, calculate the phase between different audios Like property, similarity is greater than preset threshold value, then is classified as one kind, and such is labeled, such as be labeled as speaker A With speaker B.
In order to make it easy to understand, referring to Fig. 5, Fig. 5 is that a kind of audio data based on speaker's identity of the application cleans dress The structure drawing of device of the one embodiment set, as shown in figure 5, specifically including:
Decoder module 501, for being decoded to the original audio data of acquisition.
It should be noted that since audio is often convenient for storage, transmission, friendship during transmission in order to reduce volume Stream just will do it coding and (compress) processing.The audio data obtained in actual business scenario may have pcap format, This format cannot be handled directly, need for pcap file to be decoded to obtain accessible audio format, this audio lattice Formula can be wav format.
Audio separation module 502, for isolating effective audio in decoded audio data, effective audio is to include people The segment of sound.
It should be noted that may also may include inhuman comprising the sound clip of speaker in decoded audio data Sound segment;Non- vocal segments can be ambient noise, be also possible to musical sound, can also be other interference fragments, in order to give Subsequent processing links provide more structurally sound signal, ambient noise, musical sound or interference fragment can be filtered place Reason can also be removed using other modes;Vocal segments can also be directly intercepted out from audio data, such as are had by detection The starting point and ending point of vocal segments will have the efficient voice of voice to extract, can also be logical to the efficient voice extracted It crosses filtering processing and gets rid of noise that may be present, obtain more clearly visible effective audio signal.
Segmentation module 503 is used for the effective audio of segment processing, obtains the audio of several segmentations, in the audio of several segmentations Every section audio corresponds to single people.
It should be noted that the voice for different people is separated, efficient voice can be segmented first, i.e. root Characteristic when according to voice conversion, turning point when positioning speaker switches, to be cut according to turning point.
Words person's separation module 504 is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the same person's First audio data.
It should be noted that the audio to several segmentations carries out words person's separation, the first audio number of the same person will be belonged to According to separating, algorithm used by separating can be HAC clustering algorithm, can also be gathered using other clustering algorithms It closes, so that being aggregated to the audio fragment for belonging to the same speaker together from the audio of several segmentations, thus by different theorys The audio of words people separates.
Vocal print cluster module 505 is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster To second audio data, second audio data is labeled, second audio data includes the first sound that multistage belongs to same people Frequency evidence.
It should be noted that the first audio data of multistage is obtained after multistage original audio is carried out above-mentioned steps processing, In, the first audio of multistage, which may belong to the same person, may also belong to different people, it is therefore desirable to by the first audio data of multistage Vocal print cluster is carried out, the audio data for belonging to the same person is polymerize, obtains second audio data, then by the second audio number According to being labeled, speaker a or vocal print a can be labeled as.
For the embodiment of the present application by carrying out segment processing to audio data, obtained every section audio segment is all corresponding single People, then audio fragment is clustered, the segment for belonging to the same person is polymerize, so that fining is isolated in audio The voice messaging of different people.
The embodiment of the present application also provides a kind of another implementation of audio data cleaning device based on speaker's identity Example, specifically includes:
Decoder module, for being decoded to the original audio data of acquisition;
Audio separation module, for isolating effective audio in decoded audio data, effective audio is to include voice Segment;
Languages separation module for carrying out languages identification to effective audio, and isolates effective audio of required languages.
Segmentation module is used for the effective audio of segment processing, obtains the audio of several segmentations, every section in the audio of several segmentations Audio corresponds to single people;
Words person's separation module is isolated for carrying out words person's separation to the audio of several segmentations and belongs to the of the same person One audio data;
Vocal print cluster module is obtained for the first audio data corresponding in multistage original audio to be carried out vocal print cluster Second audio data is labeled by second audio data, and second audio data includes the first audio that multistage belongs to same people Data.
The embodiment of the present application also provides a kind of audio data cleaning equipment based on speaker's identity, and equipment includes processor And memory: program code is transferred to processor for storing program code by memory;Processor is used for according to program A kind of the step of instruction in code, execution such as audio data cleaning method based on speaker's identity of above-described embodiment.
The embodiment of the present application also provides a kind of computer readable storage medium, for storing program code, the program code Any one implementation in a kind of audio data cleaning method based on speaker's identity for executing foregoing individual embodiments Mode.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
The description of the present application and term " first " in above-mentioned attached drawing, " second " etc. are to be used to distinguish similar objects, Without being used to describe a particular order or precedence order.It should be understood that the data used in this way in the appropriate case can be mutual It changes, so that embodiments herein described herein for example can be with the sequence other than those of illustrating or describing herein Implement.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, The process, method, system, product or equipment for containing a series of steps or units those of are not necessarily limited to be clearly listed step Or unit, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment Or unit.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two More than a.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the module, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or unit Letter connection can be electrical property, mechanical or other forms.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation: RAM), the various media that can store program code such as magnetic or disk.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of audio data cleaning method based on speaker's identity characterized by comprising
The original audio data of acquisition is decoded;
Effective audio in the decoded audio data is isolated, effective audio is the segment comprising voice;
Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding in the audio of several segmentations Single people;
Words person's separation is carried out to the audio of several segmentations, isolates the first audio data for belonging to the same person;
First audio data corresponding in multistage original audio is subjected to vocal print cluster, obtains second audio data, by the Two audio datas are labeled, and the second audio data includes the first audio data that multistage belongs to same people.
2. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that dividing After separating out effective audio in the decoded audio data, further includes:
Languages identification is carried out to effective audio, and isolates effective audio of required languages.
3. according to the method described in claim 2, it is characterized in that, described carry out languages identification to the effective audio, and mentioning The audio of languages needed for taking out includes:
By effective audio input into languages identification model, required languages are identified, and extract the sound of corresponding languages Frequently.
4. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described Effective audio described in segment processing obtains the audio of several segmentations, and every section audio is corresponding single in the audio of several segmentations People includes:
Effective audio input of the separation is subjected to segment processing into words person's disjunctive model, several audios being segmented, Every section audio corresponds to single people in several audios.
5. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described Words person's separation is carried out to the audio of several segmentations, is isolated and is belonged to the first audio data of the same person and include:
Words person's separation is carried out using audio of the clustering algorithm to several segmentations, isolates the first audio for belonging to the same person The audio fragment for belonging to the same person is divided into one kind by data.
6. a kind of audio data cleaning method based on speaker's identity according to claim 1, which is characterized in that described First audio data corresponding in multistage original audio is subjected to vocal print cluster, second audio data is obtained, by the second sound For frequency according to being labeled, the second audio data includes the first audio data that multistage belongs to same people specifically:
First audio data corresponding in multistage original audio is subjected to vocal print cluster according to hierarchical clustering algorithm, obtains the Two audio datas, second audio data is labeled, and the second audio data includes the first sound that multistage belongs to same people Frequency evidence.
7. a kind of audio data cleaning device based on speaker's identity characterized by comprising
Decoder module, the decoder module is for being decoded the original audio data of acquisition;
Audio separation module, the audio separation module are used to isolate effective audio in the decoded audio data, institute Stating effective audio is the segment comprising voice;
Segmentation module, the segmentation module obtains the audio of several segmentations for effective audio described in segment processing, described several Every section audio corresponds to single people in the audio of segmentation;
Words person's separation module, words person's separation module are isolated for carrying out words person's separation to the audio of several segmentations Belong to the first audio data of the same person;
Vocal print cluster module, the vocal print cluster module be used for by first audio data corresponding in multistage original audio into Row vocal print cluster, obtains second audio data, second audio data is labeled, the second audio data includes multistage category In the first audio data of same people.
8. a kind of audio data cleaning device based on speaker's identity according to claim 7, which is characterized in that further include:
Languages separation module, the languages separation module is used to carry out languages identification to effective audio, and isolates required Effective audio of languages.
9. a kind of audio data cleaning equipment based on speaker's identity, which is characterized in that the equipment include processor and Memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor according to instruction execution one kind described in any one of claims 1-6 in said program code for being based on The audio data cleaning method of speaker's identity.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation Code, said program code require a kind of described in any item audio datas based on speaker's identity of 1-6 clear for perform claim Washing method.
CN201910809574.6A 2019-08-29 2019-08-29 A kind of audio data cleaning method, device and equipment based on speaker's identity Pending CN110491392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910809574.6A CN110491392A (en) 2019-08-29 2019-08-29 A kind of audio data cleaning method, device and equipment based on speaker's identity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910809574.6A CN110491392A (en) 2019-08-29 2019-08-29 A kind of audio data cleaning method, device and equipment based on speaker's identity

Publications (1)

Publication Number Publication Date
CN110491392A true CN110491392A (en) 2019-11-22

Family

ID=68553683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910809574.6A Pending CN110491392A (en) 2019-08-29 2019-08-29 A kind of audio data cleaning method, device and equipment based on speaker's identity

Country Status (1)

Country Link
CN (1) CN110491392A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992989A (en) * 2019-12-06 2020-04-10 广州国音智能科技有限公司 Voice acquisition method and device and computer readable storage medium
CN111354346A (en) * 2020-03-30 2020-06-30 上海依图信息技术有限公司 Voice recognition data expansion method and system
CN111508503A (en) * 2020-06-16 2020-08-07 北京爱数智慧科技有限公司 Method and device for identifying same speaker
CN111524527A (en) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 Speaker separation method, device, electronic equipment and storage medium
CN111785291A (en) * 2020-07-02 2020-10-16 北京捷通华声科技股份有限公司 Voice separation method and voice separation device
CN111899743A (en) * 2020-07-31 2020-11-06 斑马网络技术有限公司 Method and device for acquiring target sound, electronic equipment and storage medium
CN112165599A (en) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 Automatic conference summary generation method for video conference
CN112837690A (en) * 2020-12-30 2021-05-25 科大讯飞股份有限公司 Audio data generation method, audio data transcription method and device
CN112966090A (en) * 2021-03-30 2021-06-15 思必驰科技股份有限公司 Dialogue audio data processing method, electronic device, and computer-readable storage medium
CN113593578A (en) * 2021-09-03 2021-11-02 北京紫涓科技有限公司 Conference voice data acquisition method and system
CN113891177A (en) * 2021-05-31 2022-01-04 多益网络有限公司 Method, device, equipment and storage medium for generating abstract of audio and video data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102543080A (en) * 2010-12-24 2012-07-04 索尼公司 Audio editing system and audio editing method
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN106251874A (en) * 2016-07-27 2016-12-21 深圳市鹰硕音频科技有限公司 A kind of voice gate inhibition and quiet environment monitoring method and system
CN106683662A (en) * 2015-11-10 2017-05-17 中国电信股份有限公司 Speech recognition method and device
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102543080A (en) * 2010-12-24 2012-07-04 索尼公司 Audio editing system and audio editing method
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN106683662A (en) * 2015-11-10 2017-05-17 中国电信股份有限公司 Speech recognition method and device
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106251874A (en) * 2016-07-27 2016-12-21 深圳市鹰硕音频科技有限公司 A kind of voice gate inhibition and quiet environment monitoring method and system

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992989A (en) * 2019-12-06 2020-04-10 广州国音智能科技有限公司 Voice acquisition method and device and computer readable storage medium
CN110992989B (en) * 2019-12-06 2022-05-27 广州国音智能科技有限公司 Voice acquisition method and device and computer readable storage medium
CN111354346B (en) * 2020-03-30 2023-04-07 上海依图信息技术有限公司 Voice recognition data expansion method and system
CN111354346A (en) * 2020-03-30 2020-06-30 上海依图信息技术有限公司 Voice recognition data expansion method and system
CN111524527A (en) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 Speaker separation method, device, electronic equipment and storage medium
CN111524527B (en) * 2020-04-30 2023-08-22 合肥讯飞数码科技有限公司 Speaker separation method, speaker separation device, electronic device and storage medium
CN111508503A (en) * 2020-06-16 2020-08-07 北京爱数智慧科技有限公司 Method and device for identifying same speaker
CN111785291A (en) * 2020-07-02 2020-10-16 北京捷通华声科技股份有限公司 Voice separation method and voice separation device
CN111899743A (en) * 2020-07-31 2020-11-06 斑马网络技术有限公司 Method and device for acquiring target sound, electronic equipment and storage medium
CN112165599A (en) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 Automatic conference summary generation method for video conference
CN112837690A (en) * 2020-12-30 2021-05-25 科大讯飞股份有限公司 Audio data generation method, audio data transcription method and device
CN112837690B (en) * 2020-12-30 2024-04-16 科大讯飞股份有限公司 Audio data generation method, audio data transfer method and device
CN112966090A (en) * 2021-03-30 2021-06-15 思必驰科技股份有限公司 Dialogue audio data processing method, electronic device, and computer-readable storage medium
CN112966090B (en) * 2021-03-30 2022-07-12 思必驰科技股份有限公司 Dialogue audio data processing method, electronic device, and computer-readable storage medium
CN113891177A (en) * 2021-05-31 2022-01-04 多益网络有限公司 Method, device, equipment and storage medium for generating abstract of audio and video data
CN113891177B (en) * 2021-05-31 2024-01-05 多益网络有限公司 Abstract generation method, device, equipment and storage medium of audio and video data
CN113593578A (en) * 2021-09-03 2021-11-02 北京紫涓科技有限公司 Conference voice data acquisition method and system

Similar Documents

Publication Publication Date Title
CN110491392A (en) A kind of audio data cleaning method, device and equipment based on speaker's identity
CN109522556B (en) Intention recognition method and device
US10666792B1 (en) Apparatus and method for detecting new calls from a known robocaller and identifying relationships among telephone calls
CN101447185B (en) Audio frequency rapid classification method based on content
CN103700370A (en) Broadcast television voice recognition method and system
CN106503805A (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method
CN106547789B (en) Lyric generation method and device
CN110517667A (en) A kind of method of speech processing, device, electronic equipment and storage medium
CN111369981B (en) Dialect region identification method and device, electronic equipment and storage medium
WO2022134798A1 (en) Segmentation method, apparatus and device based on natural language, and storage medium
CN102779510A (en) Speech emotion recognition method based on feature space self-adaptive projection
CN109192225A (en) The method and device of speech emotion recognition and mark
CN109871686A (en) Rogue program recognition methods and device based on icon representation and software action consistency analysis
CN102915729A (en) Speech keyword spotting system and system and method of creating dictionary for the speech keyword spotting system
CN108520752A (en) A kind of method for recognizing sound-groove and device
CN109800418A (en) Text handling method, device and storage medium
CN112861984A (en) Speech emotion classification method based on feature fusion and ensemble learning
CN102201237A (en) Emotional speaker identification method based on reliability detection of fuzzy support vector machine
Van Leeuwen Speaker linking in large data sets
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Anders et al. Compensating class imbalance for acoustic chimpanzee detection with convolutional recurrent neural networks
CN106227720B (en) A kind of APP software users comment mode identification method
KR101478146B1 (en) Apparatus and method for recognizing speech based on speaker group
CN102541935A (en) Novel Chinese Web document representing method based on characteristic vectors
KR101092352B1 (en) Method and apparatus for automatic classification of sentence corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191122