CN111326162B

CN111326162B - Voiceprint feature acquisition method, device and equipment

Info

Publication number: CN111326162B
Application number: CN202010293620.4A
Authority: CN
Inventors: 肖龙源; 李稀敏; 刘晓葳; 谭玉坤; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2022-10-28
Anticipated expiration: 2040-04-15
Also published as: CN111326162A

Abstract

The invention discloses a method, a device and equipment for acquiring voiceprint characteristics. Wherein the method comprises the following steps: the method comprises the steps of obtaining voice data of a user, generating a spectrogram from the obtained voice data, refining the universality characteristic of the obtained voice data according to the spectrogram, restoring the field voice of the voice data corresponding to the universality characteristic according to the universality characteristic, and extracting the voiceprint characteristic of the field voice. By the method, the accuracy of the acquired voice data of the user can be improved, and the accuracy of the voiceprint features extracted from the voice data can be improved.

Description

Voiceprint feature acquisition method, device and equipment

Technical Field

The invention relates to the technical field of voiceprints, in particular to a method, a device and equipment for acquiring voiceprint characteristics.

Background

Voiceprints are the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has specificity, but also has the characteristic of relative stability. After the adult, the voice of the human can be kept relatively stable and unchanged for a long time. Experiments prove that the voiceprints of each person are different, and the voiceprints of the speakers are different all the time no matter the speakers deliberately imitate the voices and tone of other persons or speak with whisper and whisper, even if the imitation is vivid and lifelike.

In the existing voiceprint feature collection scheme, generally, voiceprint data of a user is obtained, and collection of the voiceprint features of the voiceprint data is completed in a manner of extracting the voiceprint features from the obtained voiceprint data, and in a voiceprint feature collection process, accuracy of the collected voiceprint features is mainly affected by accuracy of the obtained voiceprint data.

However, the existing voiceprint feature collection scheme cannot improve the accuracy of the obtained voice data of the user, and further cannot improve the accuracy of the voiceprint features extracted from the voice data.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, and a device for acquiring a voiceprint feature, which can improve the accuracy of acquired voice data of a user, and further can improve the accuracy of a voiceprint feature extracted from the voice data.

According to an aspect of the present invention, there is provided a method for acquiring a voiceprint feature, including: acquiring voice data of a user; generating a spectrogram from the acquired voice data; according to the spectrogram, refining the universality characteristics of the acquired voice data; according to the universality characteristics, restoring the live voice of the voice data corresponding to the universality characteristics; and extracting the voiceprint characteristics of the live voice.

Wherein the generating the acquired voice data into a spectrogram comprises: performing Fourier transform on the acquired voice data, performing framing and windowing on the voice data subjected to Fourier transform, performing acoustic feature mapping on the voice data subjected to framing and windowing, and generating a spectrogram according to the voice data subjected to acoustic feature mapping.

Wherein, the refining the universality characteristic of the acquired voice data according to the spectrogram comprises: and acquiring distribution graphs of all the acoustic features on the spectrogram, setting the acoustic features corresponding to the distribution graphs with the areas not smaller than a preset threshold value as the universal features, and refining the universal features of the acquired voice data.

Wherein, according to the universality characteristic, the method for restoring the live voice of the voice data corresponding to the universality characteristic comprises the following steps: and respectively restoring the field voices of the voice data at the time points corresponding to the time sequence by adopting a voice packaging mode according to the time sequence of the universal characteristic, and restoring the field voices of the voice data corresponding to the universal characteristic in a seamless splicing mode according to the time sequence of the field voices obtained by the respective restoration.

Wherein after the extracting the voiceprint features of the live speech, the method further comprises: and optimizing the generated spectrogram.

According to another aspect of the present invention, there is provided an apparatus for acquiring voiceprint characteristics, comprising: the device comprises an acquisition module, a generation module, an extraction module, a reduction module and an extraction module; the acquisition module is used for acquiring voice data of a user; the generating module is used for generating a spectrogram from the acquired voice data; the refining module is used for refining the universality characteristic of the acquired voice data according to the spectrogram; the restoring module is used for restoring the site voice of the voice data corresponding to the universality characteristic according to the universality characteristic; and the extraction module is used for extracting the voiceprint characteristics of the field voice.

The generation module is specifically configured to: performing Fourier transform on the acquired voice data, performing framing and windowing on the voice data subjected to Fourier transform, performing acoustic feature mapping on the voice data subjected to framing and windowing, and generating a spectrogram according to the voice data subjected to acoustic feature mapping.

Wherein, the refining module is specifically configured to: and acquiring distribution graphs of all acoustic features on the spectrogram, setting the acoustic features corresponding to the distribution graphs with the areas of the distribution graphs not smaller than a preset threshold value as universal features, and refining the universal features of the acquired voice data.

The reduction module is specifically configured to: and respectively restoring the field voice of the voice data on the time points corresponding to the time sequence by adopting a voice packaging mode according to the time sequence of the universal characteristic, and restoring the field voice of the voice data corresponding to the universal characteristic by carrying out seamless splicing on the field voice obtained by respectively restoring according to the time sequence.

Wherein, the collection system of voiceprint characteristics still includes: an optimization module; the optimization module is used for optimizing the generated spectrogram.

According to still another aspect of the present invention, there is provided a voiceprint feature acquisition apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the above methods of voiceprint feature acquisition.

According to a further aspect of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the method for acquiring voiceprint characteristics according to any one of the above.

It can be found that, according to the above scheme, the voice data of the user can be acquired, the acquired voice data can be generated into the spectrogram, the universality characteristic of the acquired voice data can be refined according to the spectrogram, the field voice of the voice data corresponding to the universality characteristic can be restored according to the universality characteristic, the voiceprint characteristic of the field voice can be extracted, the accuracy of the acquired voice data of the user can be improved, and the accuracy of the voiceprint characteristic extracted from the voice data can be improved.

Furthermore, the above scheme can perform fourier transform on the obtained voice data, perform framing and windowing on the voice data after fourier transform, perform acoustic feature mapping on the voice data after framing and windowing, and generate a spectrogram according to the voice data after acoustic feature mapping, which has the advantages of good similarity in classes and difference between classes, can well reflect the difference between different classes of acoustic features, and is convenient for refining the universality feature of the obtained voice data according to the generated spectrogram.

Further, according to the above scheme, all the distribution maps of the acoustic features can be obtained on the spectrogram, the acoustic features corresponding to the distribution map whose area is not smaller than the preset threshold value are set as the universal features, and the universal features of the obtained voice data are refined, so that the field voice of the voice data corresponding to the universal features can be conveniently restored according to the universal features.

Further, according to the above scheme, the above scheme may respectively restore the live voices of the voice data at the time points corresponding to the time sequence in a voice encapsulation manner according to the time sequence of the universal feature, and restore the live voices of the voice data corresponding to the universal feature in a manner of seamlessly splicing the live voices respectively restored according to the time sequence.

Furthermore, the scheme can optimize the generated spectrogram, and the method has the advantage that the accuracy of the acquired voice data of the user can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart diagram of a voiceprint feature collection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a method for collecting voiceprint characteristics according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a voiceprint feature collection apparatus according to the present invention;

FIG. 4 is a schematic structural diagram of another embodiment of the voiceprint feature collection apparatus of the present invention;

fig. 5 is a schematic structural diagram of an embodiment of the apparatus for acquiring voiceprint features of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.

The invention provides a voiceprint feature acquisition method, which can improve the accuracy of acquired voice data of a user, and further can improve the accuracy of voiceprint features extracted from the voice data.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a method for collecting voiceprint features according to the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

s101: voice data of a user is acquired.

In this embodiment, the user may be a single user or multiple users, and the present invention is not limited thereto.

In this embodiment, the voice data of multiple users may be obtained at one time, or the voice data of multiple users may be obtained for multiple times, or the voice data of users may be obtained one by one, and the present invention is not limited thereto.

In this embodiment, multiple pieces of voice data of the same user may be acquired, a single piece of voice data of the same user may be acquired, multiple pieces of voice data of multiple users may be acquired, and the present invention is not limited thereto.

S102: and generating a spectrogram by using the acquired voice data.

The generating of the spectrogram from the acquired voice data may include:

the method has the advantages that the sound spectrogram has good characteristics of similarity in classes and difference between the classes, can well reflect the difference between different classes of acoustic characteristics, and is convenient for refining the universality characteristic of the acquired voice data according to the generated sound spectrogram.

In this embodiment, the spectrogram may be a time-varying spectrogram, and may be formed by three-dimensional information of frequency, time, and amplitude of an acoustic feature, which is not limited in the present invention.

S103: and refining the universality characteristic of the acquired voice data according to the spectrogram.

Wherein, the refining the universality characteristic of the acquired voice data according to the spectrogram can include:

the method has the advantages that the field voice of the voice data corresponding to the universal characteristic can be conveniently restored according to the universal characteristic.

S104: and restoring the live voice of the voice data corresponding to the universal characteristic according to the universal characteristic.

The restoring the live speech of the speech data corresponding to the universal feature according to the universal feature may include:

according to the time sequence of the universal characteristic, the field voices of the voice data on the time points corresponding to the time sequence are respectively restored in a voice packaging mode, and the field voices of the voice data corresponding to the universal characteristic are restored in a seamless splicing mode according to the time sequence of the field voices obtained through the respective restoration.

S105: voiceprint features of the live speech are extracted.

Wherein, after the extracting the voiceprint feature of the live voice, the method further comprises:

optimizing the generated spectrogram has the advantage that the accuracy of the acquired voice data of the user can be further improved.

It can be found that, in this embodiment, the voice data of the user can be acquired, the acquired voice data can be generated into a spectrogram, the universality characteristic of the acquired voice data can be refined according to the spectrogram, the field voice of the voice data corresponding to the universality characteristic can be restored according to the universality characteristic, the voiceprint characteristic of the field voice can be extracted, the accuracy of the acquired voice data of the user can be improved, and the accuracy of the voiceprint characteristic extracted from the voice data can be improved.

Further, in this embodiment, the obtained voice data may be subjected to fourier transform, the voice data subjected to fourier transform may be subjected to framing and windowing, the voice data subjected to framing and windowing may be subjected to acoustic feature mapping, and a spectrogram may be generated according to the voice data subjected to acoustic feature mapping.

Further, in this embodiment, a distribution graph of all acoustic features may be obtained on the spectrogram, and the acoustic feature corresponding to the distribution graph whose area is not smaller than the preset threshold value is set as a universal feature, so as to refine the universal feature of the obtained voice data, which has the advantage of being able to realize a live voice that is convenient for restoring the voice data corresponding to the universal feature according to the universal feature.

Further, in this embodiment, the live voices of the voice data at the time points corresponding to the time sequence may be respectively restored in a voice encapsulation manner according to the time sequence of the universal feature, and the live voices of the voice data corresponding to the universal feature may be restored in a seamless splicing manner according to the time sequence of the live voices respectively restored.

Referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of the method for collecting voiceprint characteristics according to the present invention. In this embodiment, the method includes the steps of:

s201: voice data of a user is acquired.

As described above in S101, further description is omitted here.

S202: and generating a spectrogram by using the acquired voice data.

As described above in S102, further description is omitted here.

S203: and refining the universality characteristic of the acquired voice data according to the spectrogram.

As described above in S103, which is not described herein.

S204: and restoring the live voice of the voice data corresponding to the universal characteristic according to the universal characteristic.

As described above in S104, and will not be described herein.

S205: voiceprint features of the live speech are extracted.

As described above in S105, which is not described herein.

S206: and optimizing the generated spectrogram.

In this embodiment, the generated spectrogram may be optimized by using an optimization algorithm, may also be optimized by using a loss function of cross entropy loss, may also be optimized by using other manners, and the like, which is not limited in the present invention.

It can be found that, in this embodiment, the generated spectrogram may be optimized, which has the advantage of further improving the accuracy of the acquired voice data of the user.

The invention also provides a voiceprint feature acquisition device, which can improve the accuracy of the acquired voice data of the user, and further can improve the accuracy of the voiceprint feature extracted from the voice data.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of a voiceprint feature acquisition device according to the present invention. In this embodiment, the apparatus 30 for acquiring voiceprint features includes an obtaining module 31, a generating module 32, a refining module 33, a restoring module 34, and an extracting module 35.

The obtaining module 31 is configured to obtain voice data of a user.

The generating module 32 is configured to generate a spectrogram from the acquired voice data.

The refining module 33 is configured to refine the universality characteristic of the acquired voice data according to the spectrogram.

The restoring module 34 is configured to restore the live speech of the speech data corresponding to the universal feature according to the universal feature.

The extracting module 35 is configured to extract a voiceprint feature of the live speech.

Optionally, the generating module 32 may be specifically configured to:

carrying out Fourier transform on the obtained voice data, carrying out framing and windowing on the voice data after Fourier transform, carrying out acoustic feature mapping on the voice data after framing and windowing, and generating a spectrogram according to the voice data after acoustic feature mapping.

Optionally, the refining module 33 may be specifically configured to:

and acquiring distribution graphs of all the acoustic features on the spectrogram, setting the acoustic features corresponding to the distribution graphs with the areas of the distribution graphs not smaller than a preset threshold value as universal features, and refining the universal features of the acquired voice data.

Optionally, the reduction module 34 may be specifically configured to:

and according to the time sequence of the universal characteristic, respectively restoring the field voice of the voice data at the time points corresponding to the time sequence by adopting a voice packaging mode, and restoring the field voice of the voice data corresponding to the universal characteristic by carrying out seamless splicing on the field voice obtained by respectively restoring according to the time sequence.

Referring to fig. 4, fig. 4 is a schematic structural diagram of another embodiment of the voiceprint feature acquisition device of the present invention. Different from the previous embodiment, the apparatus 40 for acquiring voiceprint characteristics according to the present embodiment further includes an optimization module 41.

The optimizing module 41 is configured to optimize the generated spectrogram.

Each unit module of the voiceprint feature acquisition device 30/40 can respectively execute the corresponding steps in the above method embodiments, and therefore, the details of each unit module are not repeated herein, and please refer to the description of the corresponding steps above in detail.

The present invention further provides a voiceprint feature acquisition apparatus, as shown in fig. 5, including: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to execute the above-mentioned voiceprint feature acquisition method.

Wherein the memory 52 and the processor 51 are coupled in a bus, which may comprise any number of interconnected buses and bridges, which couple one or more of the various circuits of the processor 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.

The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.

The present invention further provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

Further, according to the above scheme, the field voices of the voice data at the time points corresponding to the time sequence can be respectively restored by adopting a voice encapsulation mode according to the time sequence of the universal characteristic, and the field voices of the voice data corresponding to the universal characteristic can be restored by carrying out seamless splicing on the field voices respectively restored according to the time sequence, so that the advantage that the accuracy of the acquired voice data of the user can be improved by restoring the field voices of the voice data can be realized.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only a part of the embodiments of the present invention, and not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes performed by the present invention through the contents of the specification and the drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for acquiring voiceprint features, comprising:

acquiring voice data of a user;

generating a spectrogram from the acquired voice data;

according to the spectrogram, refining the universality characteristics of the acquired voice data;

according to the universality feature, restoring the live voice of the voice data corresponding to the universality feature, wherein the restoring the live voice of the voice data corresponding to the universality feature according to the universality feature comprises:

according to the time sequence of the universal characteristic, respectively restoring the field voice of the voice data at the time points corresponding to the time sequence in a voice packaging mode, and restoring the field voice of the voice data corresponding to the universal characteristic in a seamless splicing mode according to the time sequence of the field voice obtained by the respective restoration;

and extracting the voiceprint characteristics of the live voice.

2. The method for acquiring voiceprint features of claim 1, wherein the generating the acquired voice data into a spectrogram comprises:

carrying out Fourier transformation on the acquired voice data, carrying out framing and windowing on the voice data subjected to Fourier transformation, carrying out acoustic feature mapping on the voice data subjected to framing and windowing, and generating a spectrogram according to the voice data subjected to acoustic feature mapping.

3. The method for acquiring voiceprint characteristics according to claim 1, wherein said refining the universal characteristic of the acquired voice data according to the spectrogram comprises:

and acquiring distribution graphs of all the acoustic features on the spectrogram, setting the acoustic features corresponding to the distribution graphs with the areas not smaller than a preset threshold value as the universal features, and refining the universal features of the acquired voice data.

4. The method for collecting voiceprint features of claim 1, wherein after said extracting the voiceprint features of the live speech, further comprising:

and optimizing the generated spectrogram.

5. An apparatus for acquiring voiceprint characteristics, comprising:

the device comprises an acquisition module, a generation module, an extraction module, a reduction module and an extraction module;

the acquisition module is used for acquiring voice data of a user;

the generating module is used for generating a spectrogram from the acquired voice data;

the refining module is used for refining the universality characteristic of the acquired voice data according to the spectrogram;

the restoring module is configured to restore, according to the universality feature, a live voice of the voice data corresponding to the universality feature, where the restoring module is specifically configured to: according to the time sequence of the universal characteristic, respectively restoring the field voice of the voice data at the time points corresponding to the time sequence in a voice packaging mode, and restoring the field voice of the voice data corresponding to the universal characteristic in a seamless splicing mode according to the time sequence of the field voice obtained by the respective restoration;

and the extraction module is used for extracting the voiceprint characteristics of the field voice.

6. The apparatus for acquiring voiceprint characteristics according to claim 5, wherein the generating module is specifically configured to:

7. The apparatus for collecting voiceprint characteristics according to claim 5, wherein said refining module is specifically configured to:

and acquiring distribution graphs of all acoustic features on the spectrogram, setting the acoustic features corresponding to the distribution graphs with the areas of the distribution graphs not smaller than a preset threshold value as universal features, and refining the universal features of the acquired voice data.

8. The apparatus for acquiring a voiceprint feature of claim 5, further comprising:

an optimization module;

the optimization module is used for optimizing the generated spectrogram.