CN108182945A

CN108182945A - A kind of more voice cents based on vocal print feature are from method and device

Info

Publication number: CN108182945A
Application number: CN201810201281.5A
Authority: CN
Inventors: 黎智勇
Original assignee: Guangzhou Speakin Network Technology Co Ltd
Current assignee: Guangzhou Speakin Network Technology Co Ltd
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2018-06-19

Abstract

The invention discloses a kind of more voice cents based on vocal print feature from method and device, wherein method includes：S1, acquisition include the audio source file of at least 2 voice sounds；S2, by the format conversion of audio source file be pcm forms audio file；S3, the audio file of pcm forms is cut into several voice units according to default step-length and default Cutting Length, wherein, default step-length is less than default Cutting Length；S4, speech characteristic parameter in each voice unit is extracted successively；S5, the speech characteristic parameter for comparing all voice units two-by-two successively, and calculate the matching value between the speech characteristic parameter of two voice units；S6, whether matching value between the speech characteristic parameters of two voice units is judged higher than predetermined threshold value, if so, preserving in order two voice units to same audio collection；S7, voice unit splicings all in same audio collection as single audio subfile and are preserved.

Description

A kind of more voice cents based on vocal print feature are from method and device

Technical field

The present invention relates to voice separation technology field more particularly to a kind of more voice sound separation methods based on vocal print feature And device.

Background technology

Present many momentous conferences are recorded, and have the record of many forms, such as voice, word etc., in this way Meeting review or meeting playback can be carried out afterwards.But some scenes are frequently encountered, need everyone sound list It solely out preserves, in this way convenient for preserving, accomplishes fluently label etc., can be played back later for individual.

Word can separate the record of each people with a part in a conference by the record of different people, but voice is but done not It arrives, because at meeting scene, all speak, and proprietary sound can be all entered into a section audio, in this way in audio Later stage is difficult people is marked, such as I wants to listen what someone has said at that time, we can only go to look by word That section of voice is looked for, such processing is time-consuming and laborious, and cannot exclude the probability of error.

Present minutes, the record processing in later stage is more artificial treatment, and record when needs manually to go using text Word record, even if having used recording or videograph, record later the later stage a large amount of manpower is still needed to go to handle, can just do It is marked to a section audio by people, not only expends a large amount of manpower and materials in this way, since the resolution of human ear has error, The receptible sound frequency of human ear is conditional, and the effect found out has certain subjective effect, can be to the result separated Have an impact, result in the technical issues of voice separating resulting error is larger.

Invention content

The present invention provides a kind of more voice cents based on vocal print feature from method and device, solve at present to meeting In the recording processing of record, need to expend a large amount of manpower and materials, and since the resolution of human ear has error, human ear is receptible Sound frequency is conditional, and the effect found out has certain subjective effect, can be had an impact to the result separated, caused The technical issues of voice separating resulting error is larger.

The present invention provides a kind of more voice sound separation methods based on vocal print feature, including：

S1, acquisition include the audio source file of at least 2 voice sounds；

S2, by the format conversion of the audio source file be pcm forms audio file；

S3, the audio file of the pcm forms is cut into several voices according to default step-length and default Cutting Length Unit, wherein, the default step-length is less than the default Cutting Length；

S4, speech characteristic parameter in each institute's speech units is extracted successively；

S5, the speech characteristic parameter for comparing all institute's speech units two-by-two successively, and calculate two voices Matching value between the speech characteristic parameter of unit；

S6, judge matching value between the speech characteristic parameters of two institute's speech units whether higher than default threshold Value, if so, preserving in order two institute's speech units to same audio collection；

S7, institute's speech units all in same audio collection are spliced into as single audio subfile and preserved.

Optionally, the step S2 is specifically included：

It reads byte length, sample rate and the channel information of the audio source file and stores into information bank；

The byte length of the audio source file, sample rate and channel information are removed, and are converted to the audio of pcm forms File.

Optionally, it after the step S2, is further included before the step S3：

Byte length, sample rate and the channel information of the audio source file in described information storehouse, described in removal Blank parts in audio source file.

Optionally, the step S7 is specifically included：

Institute's speech units all in same audio collection are spliced into for single audio subfile, and according to described information storehouse In the audio source file byte length, sample rate and channel information to the single audio subfile into row information add After preserve.

Optionally, it after the step S1, is further included before the step S2：

The audio source file is carried out at sampling processing and/or preemphasis processing and/or pre-filtering processing and/or adding window Reason and/or end-point detection processing.

The present invention provides a kind of more voice sound separators based on vocal print feature, including：

Acquiring unit, for obtaining the audio source file for including at least 2 voice sounds；

Format conversion unit, for by the format conversion of the audio source file be pcm forms audio file；

Cutter unit, for cutting into the audio file of the pcm forms according to default step-length and default Cutting Length Several voice units, wherein, the default step-length is less than the default Cutting Length；

Feature extraction unit, for extracting the speech characteristic parameter in each institute's speech units successively；

Contrast conting unit for comparing the speech characteristic parameter of all institute's speech units two-by-two successively, and is counted Calculate the matching value between the speech characteristic parameter of two institute's speech units；

Judging unit, for judging whether the matching value between the speech characteristic parameter of two institute's speech units is high In predetermined threshold value, if so, preserving in order two institute's speech units to same audio collection；

Splice storage unit, for being spliced into institute's speech units all in same audio collection for single audio subfile And it preserves.

Optionally, the format conversion unit specifically includes：

Reading subunit, for read byte length, sample rate and the channel information of the audio source file and store to In information bank；

Conversion subunit for the byte length of the audio source file, sample rate and channel information to be removed, and is converted Audio file for pcm forms.

Optionally, a kind of more voice sound separators based on vocal print feature provided by the invention further include：

Blank cell is removed, for byte length, the sample rate harmony of the audio source file in described information storehouse Road information removes the blank parts in the audio source file.

Optionally, splicing storage unit is additionally operable to：

Pretreatment unit, for carrying out sampling processing and/or preemphasis processing and/or pre-filtering to the audio source file Processing and/or windowing process and/or end-point detection processing.

As can be seen from the above technical solutions, the present invention has the following advantages：

The present invention provides a kind of more voice sound separation methods based on vocal print feature, including：S1, it obtains comprising at least 2 The audio source file of voice sound；S2, by the format conversion of the audio source file be pcm forms audio file；S3, according to pre- If the audio file of the pcm forms is cut into several voice units by step-length and default Cutting Length, wherein, it is described default Step-length is less than the default Cutting Length；S4, speech characteristic parameter in each institute's speech units is extracted successively；S5, successively The speech characteristic parameter of all institute's speech units is compared two-by-two, and the voice for calculating two institute's speech units is special Levy the matching value between parameter；S6, judge whether is matching value between the speech characteristic parameters of two institute's speech units Higher than predetermined threshold value, if so, preserving in order two institute's speech units to same audio collection；S7, by same audio All institute's speech units are spliced into as single audio subfile and preserve in collection.

In the present invention, by by audio-source file division into several voice units, and successively extract voice unit language Sound characteristic parameter by comparing the speech characteristic parameter of voice unit two-by-two, and calculates matching value, judges whether matching value is higher than Predetermined threshold value determines whether two sections of voice units belong to the voice of same person, is as a result, more people by audio-source file process Single audio subfile, solve in handling at present the recording of minutes, need to expend a large amount of manpower and materials, and due to The resolution of human ear has error, and the receptible sound frequency of human ear is conditional, and the effect found out has certain subjective effect The technical issues of fruit can have an impact the result separated, and caused voice separating resulting error is larger.

Description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other attached drawings according to these attached drawings.

Fig. 1 is a kind of flow of one embodiment of more voice sound separation methods based on vocal print feature provided by the invention Schematic diagram；

Fig. 2 is a kind of stream of another embodiment of more voice sound separation methods based on vocal print feature provided by the invention Journey schematic diagram；

Fig. 3 is a kind of structure of one embodiment of more voice sound separators based on vocal print feature provided by the invention Schematic diagram；

Fig. 4 is a kind of structure of one embodiment of more voice sound separators based on vocal print feature provided by the invention Schematic diagram.

Specific embodiment

An embodiment of the present invention provides a kind of more voice cents based on vocal print feature from method and device, solve at present To in the recording processing of minutes, needing to expend a large amount of manpower and materials, and since the resolution of human ear has error, human ear energy The sound frequency of receiving is conditional, and the effect found out has certain subjective effect, can be had an impact to the result separated, The technical issues of caused voice separating resulting error is larger.

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that disclosed below Embodiment be only part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention Range.

Referring to Fig. 1, the present invention provides a kind of one embodiment of more voice sound separation methods based on vocal print feature, Including：

101st, the audio source file for including at least 2 voice sounds is obtained；

102nd, by audio file of the format conversion of audio source file for pcm forms；

103rd, the audio file of pcm forms is cut into several voice lists according to default step-length and default Cutting Length Member, wherein, default step-length is less than default Cutting Length；

104th, the speech characteristic parameter in each voice unit is extracted successively；

105th, the speech characteristic parameter of all voice units is compared two-by-two successively, and the voice for calculating two voice units is special Levy the matching value between parameter；

106th, whether the matching value between the speech characteristic parameter of two voice units is judged higher than predetermined threshold value, if so, Then two voice units are preserved in order to same audio collection；

107th, voice unit splicings all in same audio collection as single audio subfile and are preserved.

In the embodiment of the present invention, by by audio-source file division into several voice units, and successively extract voice list The speech characteristic parameter of member, by comparing the speech characteristic parameter of voice unit two-by-two, and calculates matching value, judges that matching value is It is no to determine whether two sections of voice units belong to the voice of same person higher than predetermined threshold value, as a result, by audio-source file process For the single audio subfile of more people, solve in handling at present the recording of minutes, need to expend a large amount of manpower and materials, And since the resolution of human ear has error, the receptible sound frequency of human ear is conditional, and the effect found out has centainly The technical issues of subjective effect can have an impact the result separated, and caused voice separating resulting error is larger.

It is a kind of saying for one embodiment of more voice sound separation methods based on vocal print feature provided by the invention above It is bright, a kind of another embodiment of more voice sound separation methods based on vocal print feature provided by the invention will be said below It is bright.

Referring to Fig. 2, a kind of another implementation the present invention provides more voice sound separation methods based on vocal print feature Example, including：

201st, the audio source file for including at least 2 voice sounds is obtained；

It should be noted that when the audio source file of processing is minutes or report recording, lead in audio source file Often the sound including at least 2 people just needs to carry out voice separating treatment.

202nd, audio source file is carried out at sampling processing and/or preemphasis processing and/or pre-filtering processing and/or adding window Reason and/or end-point detection processing；

It should be noted that it obtains comprising after at least audio source file of 2 voice sounds, needing to carry out audio source file Sampling processing and/or preemphasis processing and/or pre-filtering processing and/or windowing process and/or end-point detection processing pretreatment.

203rd, byte length, sample rate and the channel information of audio source file are read and is stored into information bank；

It should be noted that after carrying out pretreatment operation to audio source file, read the byte length of audio source file, adopt Sample rate and channel information, and all information are stored into information bank so as to subsequent processing.

204th, the byte length of audio source file, sample rate and channel information are removed, and is converted to the audio of pcm forms File；

It should be noted that the byte length of audio source file, sample rate and channel information are got rid of, and be converted to The audio file of pcm forms removes the audio file of the pcm forms of head.

205th, byte length, sample rate and the channel information of the audio source file in information bank removes audio source document Blank parts in part；

It should be noted that byte length, sample rate and the channel information of the audio source file in information bank, to sound Frequency source file carries out space management, removes the blank information part in audio source file.

206th, the audio file of pcm forms is cut into several voice lists according to default step-length and default Cutting Length Member, wherein, default step-length is less than default Cutting Length；

It should be noted that after the blank parts in eliminating audio source file, according to default step-length and default cutting The audio file of pcm forms is cut into several voice units by length, wherein, default step-length is less than default Cutting Length, i.e., Redundancy is cut, and is avoided in cutting process, and the sound of a people is cut off or a word cuts into two syllables.

207th, the speech characteristic parameter in each voice unit is extracted successively；

It should be noted that extract the speech characteristic parameter in each voice unit successively, speech characteristic parameter include but It is not limited to mel-frequency cepstrum coefficient.

208th, the speech characteristic parameter of all voice units is compared two-by-two successively, and the voice for calculating two voice units is special Levy the matching value between parameter；

It should be noted that after the speech characteristic parameter in being extracted each voice unit, compare two-by-two successively all The speech characteristic parameter of voice unit, and the matching value between the speech characteristic parameter of two voice units is calculated, for example, separation Go out 5 sections of voice units, be then compared successively, need to carry out 5+4+3+2+1 comparison, while two voices of calculating ratio centering Matching value between the speech characteristic parameter of unit.

209th, whether the matching value between the speech characteristic parameter of two voice units is judged higher than predetermined threshold value, if so, Then two voice units are preserved in order to same audio collection；

It should be noted that whether the matching value between judging the speech characteristic parameters of two voice units is higher than default threshold Value, if so, meaning that two voice units belong to the sound of same person, two voice units are preserved in order to same In audio collection, i.e., in the audio collection of one people.

210th, voice unit splicings all in same audio collection are become into single audio subfile, and according in information bank Byte length, sample rate and the channel information of audio source file preserve after being added to single audio subfile into row information；

It should be noted that by voice unit splicings all in same audio collection as single audio subfile, and according to After byte length, sample rate and the channel information of audio source file in information bank add single audio subfile into row information It preserves.

It is to a kind of another embodiment of more voice sound separation methods based on vocal print feature provided by the invention above The explanation of progress, below will be to a kind of one embodiment of more voice sound separators based on vocal print feature provided by the invention It illustrates.

Referring to Fig. 3, the present invention provides a kind of one embodiment of more voice sound separators based on vocal print feature, Including：

Acquiring unit 301, for obtaining the audio source file for including at least 2 voice sounds；

Format conversion unit 302, for by the format conversion of audio source file be pcm forms audio file；

Cutter unit 303, if for cutting into the audio file of pcm forms according to default step-length and default Cutting Length Dry voice unit, wherein, default step-length is less than default Cutting Length；

Feature extraction unit 304, for extracting the speech characteristic parameter in each voice unit successively；

Contrast conting unit 305 for comparing the speech characteristic parameter of all voice units two-by-two successively, and calculates two Matching value between the speech characteristic parameter of voice unit；

Judging unit 306, for judging the matching value between the speech characteristic parameter of two voice units whether higher than pre- If threshold value, if so, preserving in order two voice units to same audio collection；

Splice storage unit 307, for voice unit splicings all in same audio collection to be become single audio subfile And it preserves.

In the embodiment of the present invention, by cutter unit 303 by audio-source file division into several voice units, and pass through Feature extraction unit 304 extracts the speech characteristic parameter of voice unit successively, compares voice two-by-two by contrast conting unit 305 The speech characteristic parameter of unit, and matching value is calculated, it is true that last judging unit 306 judges whether matching value comes higher than predetermined threshold value Whether fixed two sections of voice units belong to the voice of same person, as a result, by single audio of the audio-source file process for more people File solves in handling at present the recording of minutes, needs to expend a large amount of manpower and materials, and since the resolution of human ear is There is error, the receptible sound frequency of human ear is conditional, and the effect found out has certain subjective effect, can be to isolating The technical issues of result come has an impact, and caused voice separating resulting error is larger.

Be above to a kind of one embodiment of more voice sound separators based on vocal print feature provided by the invention into Capable explanation, below will be to a kind of another embodiment of more voice sound separators based on vocal print feature provided by the invention It illustrates.

Referring to Fig. 4, a kind of another implementation the present invention provides more voice sound separators based on vocal print feature Example, including：

Acquiring unit 401, for obtaining the audio source file for including at least 2 voice sounds；

Format conversion unit 402, for by the format conversion of audio source file be pcm forms audio file；

Format conversion unit 402 specifically includes：

Reading subunit 4021, for read the byte length of audio source file, sample rate and channel information and store to In information bank；

Conversion subunit 4022 for the byte length of audio source file, sample rate and channel information to be removed, and is converted Audio file for pcm forms；

Blank cell 403 is removed, is believed for the byte length of the audio source file in information bank, sample rate and sound channel Breath removes the blank parts in audio source file；

Cutter unit 404, if for cutting into the audio file of pcm forms according to default step-length and default Cutting Length Dry voice unit, wherein, default step-length is less than default Cutting Length；

Feature extraction unit 405, for extracting the speech characteristic parameter in each voice unit successively；

Contrast conting unit 406 for comparing the speech characteristic parameter of all voice units two-by-two successively, and calculates two Matching value between the speech characteristic parameter of voice unit；

Judging unit 407, for judging the matching value between the speech characteristic parameter of two voice units whether higher than pre- If threshold value, if so, preserving in order two voice units to same audio collection；

Splice storage unit 408, for voice unit splicings all in same audio collection to be become single audio subfile, And byte length, sample rate and the channel information of the audio source file in information bank to single audio subfile into row information It is preserved after addition.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function can have other dividing mode, such as multiple units or component in actual implementation It may be combined or can be integrated into another system or some features can be ignored or does not perform.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.

The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or the network equipment etc.) performs the complete of each embodiment the method for the present invention Portion or part steps.And aforementioned storage medium includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to before Embodiment is stated the present invention is described in detail, it will be understood by those of ordinary skill in the art that：It still can be to preceding The technical solution recorded in each embodiment is stated to modify or carry out equivalent replacement to which part technical characteristic；And these Modification is replaced, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of more voice sound separation methods based on vocal print feature, which is characterized in that including：

S1, acquisition include the audio source file of at least 2 voice sounds；

S3, the audio file of the pcm forms is cut into several voice units according to default step-length and default Cutting Length, Wherein, the default step-length is less than the default Cutting Length；

S5, the speech characteristic parameter for comparing all institute's speech units two-by-two successively, and calculate two institute's speech units The speech characteristic parameter between matching value；

S6, whether matching value between the speech characteristic parameters of two institute's speech units is judged higher than predetermined threshold value, if It is then to preserve two institute's speech units to same audio collection in order；

2. more voice sound separation methods according to claim 1 based on vocal print feature, which is characterized in that the step S2 It specifically includes：

The byte length of the audio source file, sample rate and channel information are removed, and are converted to the audio text of pcm forms Part.

3. more voice sound separation methods according to claim 2 based on vocal print feature, which is characterized in that the step S2 Later, it is further included before the step S3：

Byte length, sample rate and the channel information of the audio source file in described information storehouse, remove the audio Blank parts in source file.

4. more voice sound separation methods according to claim 3 based on vocal print feature, which is characterized in that the step S7 It specifically includes：

Institute's speech units all in same audio collection are spliced into for single audio subfile, and according in described information storehouse Byte length, sample rate and the channel information of the audio source file are protected after being added to the single audio subfile into row information It deposits.

5. more voice sound separation methods according to claim 1 based on vocal print feature, which is characterized in that the step S1 Later, it is further included before the step S2：

Sampling processing and/or preemphasis processing and/or pre-filtering processing and/or windowing process are carried out to the audio source file And/or end-point detection processing.

6. a kind of more voice sound separators based on vocal print feature, which is characterized in that including：

Cutter unit, it is several for cutting into the audio file of the pcm forms according to default step-length and default Cutting Length A voice unit, wherein, the default step-length is less than the default Cutting Length；

Contrast conting unit for comparing the speech characteristic parameter of all institute's speech units two-by-two successively, and calculates two Matching value between the speech characteristic parameter of a institute's speech units；

Judging unit, for judging the matching value between the speech characteristic parameter of two institute's speech units whether higher than pre- If threshold value, if so, preserving in order two institute's speech units to same audio collection；

Splice storage unit, for institute's speech units all in same audio collection to be spliced into as single audio subfile and protected It deposits.

7. more voice sound separators according to claim 6 based on vocal print feature, which is characterized in that the form turns Unit is changed to specifically include：

Reading subunit, for reading byte length, sample rate and the channel information of the audio source file and storing to information In library；

Conversion subunit for the byte length of the audio source file, sample rate and channel information to be removed, and is converted to The audio file of pcm forms.

8. more voice sound separators according to claim 7 based on vocal print feature, which is characterized in that further include：

Blank cell is removed, is believed for the byte length of the audio source file in described information storehouse, sample rate and sound channel Breath, removes the blank parts in the audio source file.

9. more voice sound separators according to claim 8 based on vocal print feature, which is characterized in that splicing preserves single Member is additionally operable to：

10. more voice sound separators according to claim 6 based on vocal print feature, which is characterized in that further include：

Pretreatment unit is handled for carrying out sampling processing and/or preemphasis processing and/or pre-filtering to the audio source file And/or windowing process and/or end-point detection are handled.