CN112581967B

CN112581967B - Voiceprint retrieval method, front-end back-end server and back-end server

Info

Publication number: CN112581967B
Application number: CN202011228722.4A
Authority: CN
Inventors: 叶林勇; 肖龙源; 李稀敏
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2023-06-23
Anticipated expiration: 2040-11-06
Also published as: CN112581967A

Abstract

The invention discloses a voiceprint retrieval method, a front-end rear-end server and a rear-end server, wherein the voiceprint retrieval method is characterized in that collected voice data is marked with a speaker ID, and voiceprint characteristics are extracted from the voice data at the rear-end server; constructing a voiceprint database according to the speaker ID and the voiceprint characteristics; exporting and registering the voiceprint database on front-end equipment; and performing voiceprint retrieval on the current voice data extracted by the front-end equipment to obtain the current speaker ID, so that the efficiency of voiceprint retrieval on the voice acquired by chat tools such as QQ and WeChat of the front-end equipment is improved, and the technical problem that the existing voiceprint retrieval method is too dependent on a back-end server and cannot realize offline retrieval is solved.

Description

Voiceprint retrieval method, front-end back-end server and back-end server

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint retrieval method, front-end equipment and a back-end server applying the method.

Background

Voiceprint recognition (Voiceprint Recognition, VPR), also known as speaker recognition (Speaker Recognition). Each person's voice contains unique biological characteristics, and voiceprint recognition refers to a technical means for recognizing a speaker by using the speaker's voice. The voiceprint recognition has high safety and reliability as the technology of fingerprint recognition and the like, and can be applied to all occasions needing to be identified. Such as in financial fields like criminal investigation, banking, securities, insurance, etc. Compared with the traditional identification technology, the voiceprint identification has the advantages of simple voiceprint extraction process, low cost, uniqueness and difficult counterfeiting and impersonation.

The main tasks of voiceprint recognition comprise voice signal processing, voiceprint feature extraction, voiceprint modeling, voiceprint matching, decision making and the like. The voiceprint matching currently mainly comprises the following steps of 1:1 comparison and 1: n retrieves two application scenarios.

Wherein, 1: the N voiceprint retrieval application scene is divided into two steps:

step one: extracting N-dimensional voiceprint feature vectors from the marked M speaker voices by using a pre-trained voiceprint model;

step two: and storing the N-dimensional feature vectors of the extracted voices of the M speakers in a database.

Traditional 1: when N voiceprint retrieval is performed, the voice acquired through front-end equipment such as a mobile phone needs to be transmitted to a back-end server, then voiceprint characteristics are extracted from the voice at the back-end server, N-dimensional characteristic vectors of M speakers in a voiceprint database are compared in sequence until the N-dimensional characteristic vectors of the M speakers are compared completely.

By adopting the traditional voiceprint retrieval scheme, voiceprint retrieval is performed on the voice extracted by front-end equipment such as qq or WeChat, the voice is required to be sent to a back-end server, then 1:N voiceprint retrieval is performed on the back-end server, and the retrieval result is returned to the front-end equipment after the retrieval of the back-end server is finished. The voiceprint retrieval mode needs to send the voices of all front-end devices to the back-end server, so that the bandwidth pressure of the back-end server is high, and the time consumption is long; and the method also needs to wait for a long time for the back-end server to return to the front-end equipment after analyzing the search result, and has low efficiency. In particular, in the absence of a network, there is no way for the off-line head-end to do 1: and N voiceprint retrieval.

Disclosure of Invention

The invention mainly aims to provide a voiceprint retrieval method, a front-end rear-end server and a rear-end server, and aims to solve the technical problems that the existing voiceprint retrieval method is too dependent on the rear-end server and can not realize offline retrieval.

In order to achieve the above object, the present invention provides a voiceprint retrieval method, which includes the steps of:

labeling the collected voice data by using a speaker ID, and extracting voiceprint features from the voice data at a back-end server;

constructing a voiceprint database according to the speaker ID and the voiceprint characteristics;

exporting and registering the voiceprint database on front-end equipment;

and performing voiceprint retrieval on the current voice data extracted by the front-end equipment to obtain the current speaker ID.

Preferably, the voiceprint database is constructed by inputting voiceprint features corresponding to M speaker IDs into a pre-trained model, and outputting N-dimensional voiceprint feature vectors corresponding to each speaker ID; the N-dimensional feature vectors of the M speaker IDs are stored in a database, and a voiceprint database with the capacity of M x N is established; in the voiceprint database, voiceprint feature vectors of each speaker are mapped by using speaker IDs.

Preferably, the voice print database is registered on the front-end equipment, and each speaker ID and the voice print characteristics corresponding to the speaker ID are exported and registered on the front-end equipment by using a voice print database exporting tool; the voiceprint features of the M speakers are respectively exported to M ark files.

Further, the data storage format of the voiceprint features is as follows: model name |model version| [ X1, X2, X3...xn ], where X1 to Xn are N-dimensional voiceprint feature vectors extracted by each speaker; when the voiceprint features are imported into front-end equipment, further matching the model name and model version of the ark file with the local model name and local model version of the front-end equipment; if the model names and/or model versions are inconsistent, the speaker voiceprint feature model is failed to be imported.

Further, the ark file is named by using a speaker ID, when the voiceprint feature is imported into the front-end device, the speaker ID of the ark file is further matched with a local speaker ID, and if the speaker ID is matched with the local voiceprint database, the voiceprint feature is imported failed.

Preferably, the extracting of the current voice data is that a voice file of the instant messaging software is extracted by a voice extracting tool, and the voice file is subjected to format conversion by a voice converting tool and is stored in a cache of the front-end equipment.

Furthermore, the voice file extracted by the voice extraction tool adopts a SILK compression format, and the voice conversion tool converts the voice file from the SILK compression format to a WAV format.

Further, voiceprint retrieval is performed on the front-end equipment, namely voiceprint features are extracted from voice files in a cache of the front-end equipment, the extracted voiceprint features are compared with the voiceprint features in a locally registered voiceprint database, and the speaker ID corresponding to the voiceprint features in the voiceprint database and the current voice data are judged to be the same speaker according to the similarity and/or confidence.

Corresponding to the voiceprint retrieval method, the invention provides front-end equipment, which comprises:

the voice acquisition module is used for acquiring current voice data;

the data storage module is used for importing a pre-constructed voiceprint database; the voiceprint database is constructed by marking a speaker ID of pre-collected voice data, extracting voiceprint characteristics from the voice data at a back-end server, and constructing the voiceprint database according to the speaker ID and the voiceprint characteristics

And the voiceprint retrieval module is used for extracting current voiceprint characteristics from the current voice data, and performing voiceprint retrieval on the voiceprint database according to the current voiceprint characteristics to obtain the current speaker ID.

In addition, to achieve the above object, the present invention also provides a back-end server, including:

the data importing module is used for importing pre-acquired voice data;

the data processing module is used for marking the speaker ID of the voice data and extracting voiceprint features of the voice data;

the voiceprint database construction module is used for constructing a voiceprint database according to the speaker ID and the voiceprint characteristics;

and the data export module is used for exporting and registering the voiceprint database to the front-end equipment.

The beneficial effects of the invention are as follows:

(1) According to the voice print searching method and device, the voice print database is built in the back-end server, exported and registered on the front-end device, voice print characteristics are directly matched on the front-end device when voice print searching is carried out, so that the current speaker ID is identified, and the problem that sensitive voice data cannot be registered on the front-end device, and therefore offline devices cannot carry out voice print searching is solved;

(2) The invention can directly carry out voiceprint retrieval on the front-end equipment without carrying out data interaction with the back-end server in the voiceprint retrieval process, and can greatly improve the efficiency of voiceprint retrieval of voices acquired by chat tools such as QQ, weChat and the like of the front-end equipment;

(3) When the voiceprint feature file is exported, the invention further carries out the matching of the model name and the model version, thereby effectively protecting the data security of the registered voiceprint data of the back-end server;

(4) The invention adopts the speaker ID to name the voiceprint characteristic file, and further carries out speaker ID matching when the voiceprint characteristic file is derived, thereby avoiding repeated registration of voiceprint data and occupying the storage space of front-end equipment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the specific embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

First embodiment (retrieval method)

The voiceprint retrieval method of the embodiment comprises the following steps:

a. labeling the collected voice data by using a speaker ID, and extracting voiceprint features from the voice data at a back-end server;

b. constructing a voiceprint database according to the speaker ID and the voiceprint characteristics;

c. exporting and registering the voiceprint database on front-end equipment;

d. and performing voiceprint retrieval on the current voice data extracted by the front-end equipment to obtain the current speaker ID.

In the step a, the voice data may be collected by a recording device, or may be collected by an intelligent terminal with a microphone, such as a mobile phone, or may also be imported from a third party voice database. The labeling of the speaker ID can be performed on external equipment or on a back-end server; moreover, the labeling can be manual labeling, and intelligent automatic labeling can also be performed by adopting a model.

In the step b, the voiceprint database is constructed by inputting voiceprint features corresponding to M speaker IDs into a pre-trained model, and outputting N-dimensional voiceprint feature vectors corresponding to each speaker ID; the N-dimensional feature vectors of the M speaker IDs are stored in a database, and a voiceprint database with the capacity of M x N is established; in the voiceprint database, the voiceprint feature vector of each speaker adopts speaker ID as mapping, and the N-dimensional voiceprint feature vector of the speaker can be found by using the speaker ID. In this embodiment, the voiceprint feature uses MFCC (Mel Frequency Cepstrum Coefficient) features, i.e., mel-frequency cepstral coefficients.

In the step c, registering the voiceprint database on the front-end equipment, namely using a voiceprint database export tool to export and register each speaker ID and the voiceprint characteristics corresponding to each speaker ID on the front-end equipment; the voiceprint features of the M speakers are respectively exported to M ark files.

In this embodiment, the data storage format of the voiceprint feature is: model name |model version| [ X1, X2, X3...xn ], where X1 to Xn are N-dimensional voiceprint feature vectors extracted by each speaker; when the voiceprint features are imported into front-end equipment, further matching the model name and model version of the ark file with the local model name and local model version of the front-end equipment; if the model names and/or model versions are inconsistent, the speaker voiceprint feature model is failed to be imported.

The ark file is named by adopting a speaker ID, and the file naming format is as follows: id. When the voiceprint feature is imported into the front-end equipment, the speaker ID of the ark file is further matched with the local speaker ID, and if the matching speaker ID exists in the local voiceprint database, the voiceprint feature is imported in failure. If the mark file indicates that the voiceprint registration ID of the speaker is 001, and if the speaker ID with the ID of 001 already exists in the local voiceprint database, the voiceprint feature of the speaker does not need to be imported again.

In the step d, the current voice data is extracted by extracting the voice file of the instant messaging software through a voice extracting tool, and the voice file is subjected to format conversion through a voice converting tool and is stored in a cache of the front-end equipment. The voice file extracted by the voice extraction tool adopts a SILK compression format, and the voice conversion tool converts the voice file from the SILK compression format to a WAV format.

And performing voiceprint retrieval on the front-end equipment, namely extracting voiceprint features from voice files in a cache of the front-end equipment, comparing the extracted voiceprint features with the voiceprint features in a locally registered voiceprint database, and judging that the speaker ID corresponding to the voice features in the voiceprint database and the current voice data are the same speaker according to the similarity and/or confidence.

Second embodiment (front-end equipment):

the embodiment also correspondingly provides front-end equipment, which comprises:

the voice acquisition module is used for acquiring current voice data;

Extracting current voiceprint characteristics from the current voice data, wherein the current voiceprint characteristics are voice files in SILK compression format of instant messaging software such as WeChat, QQ and the like, and the voice files adopt SILK compression format; and then, converting the SILK compression format voice into WAV voice format by using a developed voice conversion tool for SILK to WAV format, storing the SILK compression format voice into a cache of the front-end equipment, and extracting MFCC features from the WAV voice in the local cache.

In this embodiment, voiceprint retrieval is performed on the voiceprint database according to the current voiceprint feature, which is to perform 1 on the extracted MFCC feature and the locally registered voiceprint database: n voiceprint retrieval, namely comparing feature similarity one by one, outputting 5 voiceprint features with highest scores, outputting confidence scores, and judging that the voice data in the voiceprint database and the current voice data retrieved are the same speaker when the confidence score of a voice of 5 results is larger than a threshold value x.

The front-end device includes: mobile terminals or fixed terminals with voice input functions such as mobile phones, tablet computers and sound boxes, and the front-end equipment can comprise a memory, a processor, an input unit, a display unit, a power supply and other components.

The voice acquisition module can be a recording tool or an instant messaging tool such as QQ and WeChat, and the voiceprint retrieval module can be APP software with an identity verification function or APP software special for voiceprint retrieval. The voiceprint database can be imported through a voiceprint importing tool on the front-end equipment or imported through voiceprint database importing equipment of a third party.

The memory may be used to store software programs and modules, and the processor executes the software programs and modules stored in the memory to perform various functional applications and data processing. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor and the input unit.

The input unit may be used to receive input digital or character or image information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control, in addition to a voice acquisition module such as a microphone.

The display unit may be used to display information entered by a user or provided to a user and various graphical user interfaces of the back-end server, which may be composed of graphics, text, icons, video and any combination thereof. The display unit may include a display panel, and alternatively, the display panel may be configured in the form of an LCD (Liquid Crystal Display ), an OLED (Organic Light-Emitting Diode), or the like.

Third embodiment (backend server):

the embodiment also provides a backend server, which includes:

the data importing module is used for importing pre-acquired voice data;

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the front-end device embodiment and the back-end server embodiment, the description is relatively simple, as it is substantially similar to the method embodiment, and reference is made to the description of the method embodiment for relevant points.

Also, herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the foregoing description illustrates and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, but is capable of use in various other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept, either as described above or as a matter of skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1.A voiceprint retrieval method is characterized by comprising the following steps:

exporting and registering the voiceprint database on front-end equipment;

performing voiceprint retrieval on the current voice data extracted by the front-end equipment to obtain a current speaker ID;

the voice print database is registered on the front-end equipment, and each speaker ID and the voice print characteristics corresponding to the speaker ID are exported and registered on the front-end equipment by using a voice print database exporting tool; the data storage format of the voiceprint features is as follows: model name |model version| [ X1, X2, X3...xn ], where X1 to Xn are N-dimensional voiceprint feature vectors extracted by each speaker; when the voiceprint features are imported into front-end equipment, further matching a model name and a model version with a local model name and a local model version of the front-end equipment; if the model names and/or model versions are inconsistent, the voice print feature of the speaker fails to be imported.

2. The voiceprint retrieval method of claim 1, wherein: the voiceprint database is constructed by inputting voiceprint features corresponding to M speaker IDs into a pre-trained model, and outputting N-dimensional voiceprint feature vectors corresponding to each speaker ID; the N-dimensional feature vectors of the M speaker IDs are stored in a database, and a voiceprint database with the capacity of M x N is established; in the voiceprint database, voiceprint feature vectors of each speaker are mapped by using speaker IDs.

3. The voiceprint retrieval method of claim 1, wherein: the voiceprint features of the M speakers are exported to M ark files, respectively.

4. A voiceprint retrieval method according to claim 3, wherein: the ark file is named by adopting a speaker ID, when the voiceprint feature is imported into front-end equipment, the speaker ID of the ark file is further matched with a local speaker ID, and if the speaker ID is matched with a local voiceprint database, the voiceprint feature is imported failed.

5. The voiceprint retrieval method of claim 1, wherein: the current voice data is extracted by extracting a voice file of instant messaging software through a voice extracting tool, and the voice file is subjected to format conversion through a voice conversion tool and is stored in a cache of front-end equipment.

6. The voiceprint retrieval method of claim 5, wherein: and the voice file extracted by the voice extraction tool adopts a SILK compression format, and the voice conversion tool converts the voice file from the SILK compression format to a WAV format.

7. The voiceprint retrieval method of claim 5, wherein: and performing voiceprint retrieval on the front-end equipment, namely extracting voiceprint features from voice files in a cache of the front-end equipment, comparing the extracted voiceprint features with the voiceprint features in a locally registered voiceprint database, and judging that the speaker ID corresponding to the voice features in the voiceprint database and the current voice data are the same speaker according to the similarity and/or confidence.

8. A front-end device, comprising:

the voice acquisition module is used for acquiring current voice data;

the data storage module is used for importing and registering a pre-constructed voiceprint database to the front-end equipment; the voice print database is formed by marking a speaker ID of pre-collected voice data, extracting voice print characteristics from the voice data at a back-end server, and constructing the voice print database according to the speaker ID and the voice print characteristics;

the voiceprint retrieval module is used for extracting current voiceprint characteristics from the current voice data, and performing voiceprint retrieval on the voiceprint database according to the current voiceprint characteristics to obtain a current speaker ID;

9. A back-end server, comprising:

the data importing module is used for importing pre-acquired voice data;

the data export module is used for exporting and registering the voiceprint database to front-end equipment;