CN111583938A

CN111583938A - Electronic device and voice recognition method

Info

Publication number: CN111583938A
Application number: CN202010424050.8A
Authority: CN
Inventors: 刘恕; 寻亮; 廖文伟
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2020-08-25
Anticipated expiration: 2040-05-19
Also published as: TW202145037A; CN111583938B; TWI725877B

Abstract

The invention provides an electronic device and a voice recognition method. The method comprises the following steps: receiving voice data and generating corresponding voiceprint characteristic vectors according to the voice data; loading a voiceprint feature clustering model from a voiceprint feature clustering model database; inserting a plurality of nodes of the voiceprint feature clustering model into a candidate node sequence to become a plurality of candidate nodes, and identifying a target candidate node closest to the voiceprint feature vector in the plurality of candidate nodes; finding out target candidate voiceprint characteristic vector samples matched with the voiceprint characteristic vectors in all candidate voiceprint characteristic vector samples in the target candidate nodes; and identifying the target user and the target user information corresponding to the target candidate voiceprint feature vector sample. Therefore, the user information corresponding to the voice data can be accurately and efficiently searched, and the capability of the electronic device for recognizing the voice data is improved.

Description

Electronic device and voice recognition method

Technical Field

The present invention relates to an electronic device, and more particularly, to an electronic device for recognizing received voice and a voice recognition method used by the electronic device.

Background

Voiceprint refers to a spectrum graph showing characteristics of sound waves drawn by a special electroacoustic conversion instrument (such as a sonographer, a voice graph and the like), and is a collection of various acoustic characteristic maps. For human body, the voiceprint is a long-term stable characteristic signal, and the voiceprint corresponding to the voice of each person has strong personal color due to the inherent physiological difference of the vocal organs and the acquired behavior difference. Voiceprint data collected by informatization is called voiceprint information.

Thus, the voice recognition operation is a biometric technology that can extract voice features/speech content information (also referred to as voice information) of a speaker, convert the voice information into corresponding voiceprint information, and recognize the identity of the speaker based on the converted voiceprint information. Voiceprint recognition mainly collects human voice information, extracts a specific voice feature and converts it into a digital symbol. A common recognition system converts received voice information into a set of multi-dimensional feature vectors (hereinafter referred to as voiceprint feature vectors).

The prior art speech recognition operation is typically a one-to-one recognition operation of received speech information based on all voiceprint information in a pre-established voiceprint database. In detail, the voiceprint database in the prior art is established to store only a plurality of voiceprint information (which may also be called a voiceprint information sample) corresponding to a plurality of collected persons. When voiceprint recognition is performed on voiceprints to be recognized converted from received voice information, the voiceprints to be recognized need to be matched with all voiceprint information stored in a database one by one, so that voiceprint information matched with the voiceprints to be recognized is found out, and therefore relevant information of collected personnel corresponding to the voiceprints to be recognized is recognized.

However, the total amount of all voiceprint information stored in the voiceprint database storage data is huge, and the voice recognition operation adopting the above-mentioned prior art consumes a lot of time in the process of matching one by one.

In view of this, how to provide a faster and more accurate speech recognition method is a goal of those skilled in the art to address the development.

Disclosure of Invention

An embodiment of the invention provides an electronic device. The electronic device comprises an input/output device, a storage device and a processor. The storage device is used for recording a user information database and a voiceprint feature clustering model database. The processor is configured to receive voice data via the input/output device, and the processor is further configured to generate corresponding voiceprint feature vectors according to the voice data. The processor is further configured to load a voiceprint feature clustering model from the voiceprint feature clustering model database, wherein the voiceprint feature clustering model comprises a plurality of nodes, wherein the plurality of nodes are partitioned into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes comprise a root node, a plurality of leaf nodes, and a plurality of relay nodes, and each of the plurality of nodes comprises a plurality of voiceprint feature vector samples. The processor is further configured to insert a plurality of first child nodes of the root node into a candidate node sequence to become a plurality of candidate nodes, and calculate candidate distances between node-averaged voiceprint feature vectors and the voiceprint feature vectors of the plurality of candidate nodes. The processor is further configured to rank all candidate nodes according to the calculated plurality of candidate distances, and retain only a first N candidate nodes in the sequence of candidate nodes, where a first candidate node in the first N candidate nodes has a smallest candidate distance, where N is a candidate node upper limit value. The processor is further configured to determine whether each of the plurality of candidate nodes is one of the plurality of leaf nodes, wherein in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes, the processor is further configured to perform the following steps (1) to (3): (1) identifying a plurality of candidate voiceprint feature vector samples of all candidate nodes in the sequence of candidate nodes; (2) comparing all the candidate voiceprint feature vector samples with the voiceprint feature vectors respectively to find out target candidate voiceprint feature vector samples which are matched with the voiceprint feature vectors in all the candidate voiceprint feature vector samples; and (3) identifying a target user and target user information corresponding to the target candidate voiceprint feature vector sample, and mapping the target user and the target user information to the voice data, thereby completing the identification operation corresponding to the voice data.

An embodiment of the present invention provides a speech recognition method. The method comprises the following steps: receiving voice data, and generating corresponding voiceprint characteristic vectors according to the voice data; loading a voiceprint feature clustering model from a voiceprint feature clustering model database, wherein the voiceprint feature clustering model comprises a plurality of nodes, wherein the plurality of nodes are partitioned into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes comprise a root node, a plurality of leaf nodes, and a plurality of relay nodes, each of the plurality of nodes comprising a plurality of voiceprint feature vector samples; inserting a plurality of first child nodes of the root node into a candidate node sequence to become a plurality of candidate nodes, and calculating a candidate distance between a node average voiceprint feature vector of each of the plurality of candidate nodes and the voiceprint feature vector; ranking all candidate nodes according to the calculated plurality of candidate distances and retaining only the first N candidate nodes in the sequence of candidate nodes, wherein the first candidate node in the first N candidate nodes has the smallest candidate distance, wherein N is a candidate node upper limit value; determining whether each of the candidate nodes is one of the leaf nodes; and in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes, (1) performing (3) identifying a plurality of candidate voiceprint feature vector samples for each of all candidate nodes in the sequence of candidate nodes; (2) comparing all the candidate voiceprint feature vector samples with the voiceprint feature vectors respectively to find out target candidate voiceprint feature vector samples which are matched with the voiceprint feature vectors in all the candidate voiceprint feature vector samples; and (3) identifying a target user and target user information corresponding to the target candidate voiceprint feature vector sample, and mapping the target user and the target user information to the voice data, thereby completing the identification operation corresponding to the voice data.

In an embodiment of the invention, in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes, the processor is further configured to determine whether a total number of candidate voiceprint feature vector samples in the plurality of candidate nodes is less than a candidate sample number threshold, wherein in response to determining that the total number of candidate voiceprint feature vector samples in the plurality of candidate nodes is not less than the candidate sample number threshold, the processor is further configured to insert a plurality of second child nodes of each of the plurality of candidate nodes into the candidate node sequence to become a new plurality of candidate nodes, and calculate a candidate distance between a node average voiceprint feature vector and the voiceprint feature vector of each of the new plurality of candidate nodes, and the processor is further configured to perform the sorting all candidate nodes again according to the calculated plurality of candidate distances, and retaining only the first N candidate nodes in the sequence of candidate nodes.

In an embodiment of the present invention, in response to determining that the total number of the candidate voiceprint feature vector samples in the candidate nodes is smaller than the threshold value of the number of candidate samples, the processor is further configured to perform steps (1) - (3) again.

In an embodiment of the present invention, wherein the step (2) includes: the processor identifying a plurality of candidate voiceprint feature vectors for each of the plurality of candidate voiceprint feature vector samples, wherein the plurality of candidate voiceprint feature vector samples correspond to a plurality of candidate users; the processor calculating a plurality of distances between the plurality of candidate voiceprint feature vectors corresponding to each candidate voiceprint feature vector sample and a voiceprint feature vector of the received speech data; the processor identifying a smallest distance of the plurality of distances as a target distance; and the processor judges whether the target distance is smaller than a matching distance threshold value, wherein in response to judging that the target distance is smaller than the matching distance threshold value, the processor judges the candidate voiceprint feature vector sample to which the candidate voiceprint feature vector corresponding to the target distance belongs as a target candidate voiceprint feature vector sample matching the voiceprint feature vector.

In an embodiment of the present invention, in response to determining that the target distance is not less than the matching distance threshold, the processor determines that none of the plurality of candidate voiceprint feature vector samples have the target candidate voiceprint feature vector sample matching the voiceprint feature vector; the processor identifies a target node containing a candidate voiceprint feature vector sample to which a candidate voiceprint feature vector corresponding to the target distance belongs and a target parent node of the target node; and the processor generating a new child node connected to the target parent node and adding the voiceprint feature vector to a voiceprint feature vector sample of a new user of the new child node corresponding to the speech data.

In an embodiment of the present invention, in response to determining that the target distance is not less than a matching distance threshold, the processor determines that the received voice data cannot be matched, and determines that the user corresponding to the voice data is an unregistered user.

In an embodiment of the present invention, the voiceprint feature clustering model is created by the processor executing a voiceprint feature clustering model creating operation, wherein in the voiceprint feature clustering model creating operation, the processor extracts a plurality of voice data of each of a plurality of users from a plurality of pieces of user information of the user information database corresponding to the plurality of users; the processor generating a plurality of voiceprints for each of the plurality of users from the plurality of speech data for each of the plurality of users; the processor calculating, from the voiceprints of each of the plurality of users, a plurality of M-dimensional voiceprint feature vectors of each of the plurality of users corresponding to the voiceprints of each of the plurality of users, wherein M is a positive integer; the processor calculates an average voiceprint feature vector of each of the plurality of users according to the M-dimensional voiceprint feature vectors of each of the plurality of users, so that the average voiceprint feature vectors of the plurality of users are used as sample average voiceprint feature vectors of each of a plurality of voiceprint feature vector samples; and the processor performs multilevel unsupervised clustering operation on the plurality of voiceprint feature vector samples based on the average voiceprint feature vectors of the plurality of samples, and groups the plurality of voiceprint feature vector samples into a plurality of nodes of a plurality of layers so as to establish a voiceprint feature clustering model of a multilevel tree structure.

In an embodiment of the present invention, wherein the total number of the voiceprint feature vector samples is P, wherein in the operation of performing the multi-level unsupervised clustering operation on the voiceprint feature vector samples based on the sample average voiceprint feature vectors, grouping the voiceprint feature vector samples into the nodes of the layers to establish the voiceprint feature clustering model of the multi-level tree structure, the processor calculates the distance between P sample average voiceprint feature vectors according to the sample average voiceprint feature vectors of the P voiceprint feature vector samples as the initial distance between the P voiceprint feature vector samples, where P is a positive integer; the processor sets each voiceprint feature vector sample to be divided into independent nodes, and calculates node distances among the nodes according to a plurality of initial distances; the processor selects Q nodes which are closest to all nodes without parent nodes as target nodes respectively according to a plurality of node distances, and combines the Q target nodes into the parent nodes of the Q target nodes, wherein the Q target nodes are child nodes of the parent nodes respectively, and Q is a positive integer larger than 1; the processor records node information corresponding to the parent node, wherein the node information corresponding to the parent node comprises a node average voiceprint feature vector of the parent node, a node radius of the parent node, and a total sample number of the parent node; the processor estimates node distances between the parent node and the other nodes according to the initial distances between all the voiceprint feature vector samples in the parent node and all the voiceprint feature vector samples of the other nodes; the processor determines whether the merged parent node has P voiceprint feature vector samples; and in response to determining that the merged father node has the P voiceprint feature vector samples, the processor performs a pruning operation on a current first multi-level tree structure with all nodes to update the first multi-level tree structure into a second multi-level tree structure, thereby completing establishment of the voiceprint feature clustering model, wherein the total number of nodes and the total number of layers of the second multi-level tree structure are smaller than the total number of nodes and the total number of layers of the first multi-level tree structure, and the father node having the P voiceprint feature vector samples is the root node of the established voiceprint feature clustering model.

In an embodiment of the present invention, in response to determining that the merged parent node does not have the P voiceprint feature vector samples, the processor performs the steps of selecting the Q closest nodes from all nodes not having the parent node as the target nodes, respectively, according to the plurality of node distances, and merging the Q target nodes into the parent node of the Q target nodes, again.

In an embodiment of the present invention, the processor calculates an average of the sample average voiceprint feature vectors of each of the Q target nodes as the node average voiceprint feature vector of the parent node, wherein the processor calculates Q distances between the node average voiceprint feature vector of the parent node and the sample average voiceprint feature vectors of each of the Q target nodes, and takes a largest one of the Q distances as the node radius of the parent node, wherein the processor identifies a total number of all voiceprint feature vector samples of all child nodes in the parent node as the total number of samples of the parent node.

Based on the above, the electronic device and the voice recognition method provided by the embodiments of the invention can match the voiceprint feature vectors of the received voice data to be recognized according to the clustered nodes in the voiceprint feature clustering model through the voiceprint feature clustering model, so as to accurately and efficiently search the user information corresponding to the voice data, thereby improving the capability of the electronic device to recognize the voice data. In addition, the scale of the established voiceprint feature clustering model is reduced through pruning operation under the condition that the performance is not greatly reduced, so that the efficiency of searching the voiceprint feature clustering model is improved.

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention.

Fig. 2 is a flowchart illustrating a speech recognition method according to an embodiment of the invention.

Fig. 3A is a flowchart illustrating a method for calculating a node average voiceprint feature vector of a node according to an embodiment of the invention.

Fig. 3B is a flowchart illustrating a process of searching for a target candidate voiceprint feature vector sample that matches a voiceprint feature vector of received speech data according to an embodiment of the invention.

Fig. 4A is a flowchart illustrating a process of registering a user according to an embodiment of the invention.

Fig. 4B is a schematic diagram of registering a user according to an embodiment of the invention.

Fig. 5A is a flowchart illustrating a process of establishing a voiceprint feature clustering model according to an embodiment of the invention.

Fig. 5B is a flowchart illustrating a multi-level unsupervised clustering operation according to an embodiment of the invention.

Fig. 6A to 6C are schematic diagrams illustrating node cluster merging according to an embodiment of the invention.

Fig. 7A is a schematic diagram of a multi-level tree structure according to an embodiment of the invention.

Fig. 7B is a schematic diagram illustrating a trimming operation according to an embodiment of the invention.

Fig. 8 is a schematic diagram illustrating a distance between nodes according to an embodiment of the invention.

Fig. 9A is a diagram illustrating a user recognizing voice data according to an embodiment of the invention.

Fig. 9B-9C are schematic diagrams illustrating matching voiceprint feature vectors according to an embodiment of the invention.

Wherein the symbols in the drawings are briefly described as follows:

10: an electronic device; 110: a processor; 120: a storage device; 121: a user information database; 122: a voiceprint feature clustering model database; 130: a main memory; 140: an input/output device; 150: a communication circuit unit; s21, S22, S23, S24, S25, S26, S27, S28, S29, S30: the flow steps of the voice recognition method; s311 and S312: calculating the node average voiceprint characteristic vector of the candidate node; s321, S322, S323, S324, S325, S326, S327, S328, S329: a step of searching a target candidate voiceprint feature vector sample matching the voiceprint feature vector of the received voice data; s41, S42, S43, S44, S45, S46, S47, S48: registering a user to a voiceprint feature clustering model; A. b, C, D, E, F, AB, EF, EFD, EFDC, EFDCAB, T: a node; a41, a42, a43, a44, a45, a46, a91, a92, a93, a 94: an arrow; s51, S52, S53, S54, S55: establishing a voiceprint characteristic clustering model; s551, S552, S553, S554, S555, S556, S557: the flow of step S55 of fig. 5A; t: voice print characteristic vectors of voice data to be recognized/voice data to be registered;

distance.

Detailed Description

Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention. Referring to fig. 1, in the present embodiment, an electronic device 10 includes a processor 110, a storage device 120, a main memory 130, an input/output device 140, and a communication circuit unit 150.

The processor 110 is a hardware with computing capability for managing the overall operation of the electronic device 10. That is, the processor 110 is the main hardware element for managing other elements of the electronic device 10. In the embodiment, the Processor 110 is, for example, a single-core or multi-core Central Processing Unit (CPU), a microprocessor (micro-Processor), or other Programmable Processing Unit (Processing Unit), a Digital Signal Processor (DSP), a Programmable controller, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or the like.

The storage device 120 can record some data that needs to be stored for a long time through the instruction of the processor 110, for example, firmware or software for controlling the electronic device 10; one or more program code modules; and one or more databases. The storage device 120 may be any type of Hard Disk Drive (HDD), nonvolatile memory storage device (e.g., solid state Drive), or other type of storage circuit unit. The one or more program code modules include a speech recognition program code module. In the embodiment, the processor 110 may execute the speech recognition operation by accessing and executing the speech recognition program code module, so as to implement the speech recognition method provided by the embodiment of the invention.

In the present embodiment, the one or more databases include a user information database 121 and a voiceprint feature clustering model database 122. The user information database 121 is configured to record user information of each of a plurality of users, where the user information corresponding to each user at least includes identity information associated with the corresponding user and voice information (or voiceprint information) of the corresponding user. The identity information includes one or more of name, gender, age, identification number, telephone number, residential address, native place, birthday, blood type, special verification information, or any other information relevant to the identity of the corresponding user. The voice information (also referred to as voice data) includes digital information/files of one or more voices/sounds spoken by the corresponding user.

The voiceprint information is information/data of a spectrum graph which can show sound wave characteristics of the voice information and is drawn/converted by a special electroacoustic conversion instrument (such as a sonographer, a voice plotter and the like) based on the voice information of the corresponding user. Because the physiological voice production structure of each user is different, the same words spoken by different users will be transformed into different shapes/features/aspects of voiceprints. In addition, because different words are spoken with different sounds (different voice messages), different voice messages of the same user are converted into different voiceprints.

The special authentication information is information for performing an authentication procedure on a corresponding user, such as a password, a face image, a fingerprint, a pupil image, and the like. In addition, the special authentication information may also be a combination of multiple kinds of identity information of the corresponding user.

On the other hand, the voiceprint feature clustering model database 122 is used to record one or more established voiceprint feature clustering models. The processor 110 may access the voiceprint feature clustering model database 122 to select and load the established voiceprint feature clustering model into the main memory 130 to update or use the voiceprint feature clustering model. The updated voiceprint feature clustering model may then be stored in the voiceprint feature clustering model database 122.

The processor 110 may establish a new voiceprint feature clustering model based on one or more speech information (speech data)/voiceprint information of each of the plurality of users in the user information database. In addition, in other embodiments, the voiceprint feature clustering model recorded by the voiceprint feature clustering model database 122 can also be obtained through a connection established with other electronic devices. The multiple users may also be referred to as multiple registered users or authenticated users.

In the present embodiment, the main memory 130 is used for temporarily storing data. The main memory 130 is, for example, a dynamic random access memory. The data includes firmware to manage the electronic device, software to perform voice recognition operations, or data (e.g., voiceprint feature clustering models or voice data/voiceprint feature vectors to be recognized, etc. a variety of data).

The input/output device 140 is used for inputting/outputting data/information. The information may be voice information, text information, image information or other forms of multimedia information. In the present embodiment, the I/O device 140 includes one or more of a microphone, a keyboard, a mouse, a touch screen, a speaker, a liquid crystal screen, a touch pad, a camera, a physical button, and a radio frequency identification device.

For example, the input/output device 140 may instruct the processor 110 through a received input operation to trigger performance of a speech recognition operation. For example, a physical button of the input/output device 140 is pressed by a user to cause the processor 110 to start performing a voice recognition operation, and a voice spoken by the user is received through a microphone of the input/output device 140 to generate corresponding voice information to the processor 110.

The communication circuit unit 150 is used for receiving data in a wireless manner. In the present embodiment, the Communication circuit unit 150 is, for example, a wireless Communication circuit unit supporting a WiFi Communication protocol, a Bluetooth (Bluetooth), a Near Field Communication (NFC), a 3rd Generation Partnership Project (3 GPP) standard, a 4th Generation Partnership Project (4 GPP) standard, a 5th Generation Partnership Project (5 GPP) standard, and the like. In this embodiment, the communication circuit unit 150 may transmit data through a wireless network connection established with a cloud server or other electronic devices. In one embodiment, the communication circuit unit 150 is further configured to connect to a network (e.g., a telecommunication network, the internet, etc.) so that the electronic device 10 can download or upload data from the connected network. Such as speech data to be recognized/registered, or a voiceprint feature clustering model.

Fig. 2 is a flowchart illustrating a speech recognition method according to an embodiment of the invention. Referring to fig. 2, in step S21, a voice data is received, and a corresponding voiceprint feature vector is generated according to the voice data.

Specifically, the processor 110 may receive voice data from the input/output device 140 or the communication circuit unit 150. For example, a user (also referred to as a user to be identified or a user to be authenticated) speaks voice into the electronic device 10, receives the voice data via a microphone of the electronic device 10, and the voice data is transmitted to the processor 110. Processor 110 then performs a voiceprint conversion operation on the received voice data to obtain voiceprint information corresponding to the voice data. The voiceprint information includes information such as the sound intensity of the voice data corresponding to different frequencies at different times. Then, the processor 110 may identify a plurality of feature values in the voiceprint information corresponding to a plurality of feature conditions to generate a voiceprint feature vector corresponding to the voiceprint information. The voiceprint feature vector can include feature values in M dimensions. M is a positive integer.

Next, in step S22, the processor 110 loads a voiceprint feature clustering model, wherein the voiceprint feature clustering model includes a plurality of nodes, wherein the plurality of nodes are divided into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes includes a root node, a plurality of leaf nodes, and a plurality of relay nodes, and each of the plurality of nodes includes one or more voiceprint feature vector samples. For example, the voiceprint feature clustering model of the multi-level tree structure shown in fig. 7A has a plurality of nodes a, B, C, D, E, F, AB, EF, EFD, EFDC, EFDCAB, and the plurality of nodes a, B, C, D, E, F, AB, EF, EFD, EFDC, EFDCAB are grouped into a plurality of layers (5 layers) (e.g., a first layer having only a root node EFDCAB, and a last/last layer having a node E, F on the opposite side of the first layer). Where a node A, B, C, D, E, F that does not have any child nodes would be considered a leaf node, a node EFDCAB that does not have any parent node would be considered a root node, and other nodes that are not leaf nodes or root nodes would be considered relay nodes AB, EF, EFD, EFDC. The processor 110 may load the voiceprint feature clustering model into the main memory 130. Further, each node may include one or more voiceprint feature vector samples. It is worth mentioning that the voiceprint feature clustering model in fig. 7A is of the 2-way tree type, i.e., each parent node would have 2 child nodes. In addition, it should be noted that the voiceprint feature clustering model shown in FIG. 7A is a simple model for convenience of description (e.g., nodes A-F each have only one voiceprint feature vector sample). In fact, the clustering model of the voiceprint features provided by the present invention may have a bifurcation number greater than 2, a number of layers greater than 5, and the total number of voiceprint feature vector samples per node may be a number far exceeding 1 (e.g., 50 or a number greater than 50). However, the present invention is not limited to the number of layers and the total number of voiceprint feature vector samples per node. The method for establishing the voiceprint feature clustering model will be described in detail below with reference to the embodiments of fig. 6A to 7B.

Next, in step S23, the processor 110 inserts the plurality of first child nodes of the root node into the candidate node sequence to become a plurality of candidate nodes, and calculates a candidate distance between the node average voiceprint feature vector and the voiceprint feature vector of each of the plurality of candidate nodes. Specifically, the processor 110 generates a candidate node sequence to manage a plurality of candidate nodes selected to the candidate node sequence to be subsequently matched in detail.

First, the processor 110 may perform the selection of the candidate nodes from the root node, that is, the processor 110 may directly select a plurality of child nodes of the root node, insert the plurality of child nodes into the candidate node sequence, and calculate a distance between an average voiceprint feature vector of each candidate node and the generated voiceprint feature vector corresponding to the speech data. The calculation method of the average voiceprint feature vector of the candidate node is described below.

Fig. 3A is a flowchart illustrating a method for calculating a node average voiceprint feature vector of a node according to an embodiment of the invention. Referring to fig. 3A, in step S311, one or more candidate voiceprint feature vector samples in the candidate node and a sample average voiceprint feature vector of each of the one or more voiceprint feature vector samples are identified, wherein the voiceprint feature vector samples respectively correspond to different users (different registered users), and the sample average voiceprint feature vector is an average value calculated by a plurality of voiceprint feature vectors of the corresponding users. Specifically, to calculate the node-averaged voiceprint feature vector of a candidate node, the processor 110 identifies all voiceprint feature vector samples possessed by the candidate node, and the average voiceprint feature vector (also called sample-averaged voiceprint feature vector) of each voiceprint feature vector sample (each corresponding to a user). The average voiceprint feature vector is an average value of a plurality of voiceprint feature vectors of the corresponding user.

Next, in step S312, the processor 110 calculates an average value of the sample average voiceprint feature vectors as the node average voiceprint feature vector of the candidate node.

Referring back to fig. 2, in the present embodiment, the processor 110 may utilize a plurality of distance measurement methods to calculate a distance (also referred to as a candidate distance) between an average voiceprint feature vector of a candidate node and the voiceprint feature vector corresponding to the speech data.

In the present embodiment, the distance metric method is, for example, a "Probabilistic Linear Discriminant Analysis Score" (hereinafter referred to as PLDAS) method. Processor 110 may calculate the PLDAS corresponding to both from the average voiceprint feature vector of one candidate node and the voiceprint feature vector corresponding to the speech data via a probabilistic linear discriminant analysis scoring method. The larger the obtained value of the PLDAS, the more similar the two eigenvectors are, i.e., the closer the "distance" between the two eigenvectors is. The present invention is not limited to the distance measurement method described above.

For example, in one embodiment, the distance measure method is one of the following methods, such as cosine distance method, euclidean distance method, etc.

After calculating the candidate distances between the node average voiceprint feature vector and the voiceprint feature vector of the plurality of candidate nodes, continuing to step S24, the processor 110 sorts all the candidate nodes according to the calculated plurality of candidate distances, and keeps only the first N candidate nodes in the candidate node sequence, where the first candidate node in the first N candidate nodes has the smallest candidate distance, where N is the candidate node upper limit value. N may be considered the maximum number of all candidate nodes that a sequence of candidate nodes may have. For example, assume that N is 4 and the candidate node sequence has a total of 10 candidate nodes. After sorting by 10 candidate distances from each of the node averaged voiceprint feature vectors of the 10 candidate nodes to the voiceprint feature vector of the speech data from small to large, processor 110 retains only the first 4 candidate nodes corresponding to the smaller candidate distances.

Next, in step S25, it is determined whether each of the candidate nodes is one of the leaf nodes. In response to determining that the plurality of candidate nodes are all leaf nodes (none having child nodes) (step S25 → yes), performing step S28; in response to determining that the plurality of candidate nodes are not all leaf nodes (e.g., one of the plurality of candidate nodes has a child node) (step S25 → NO), step S26 is performed.

In step S26, the processor 110 determines whether the total number of the candidate voiceprint feature vector samples in the candidate nodes is smaller than a threshold value of the number of candidate samples. In response to determining that the total number of the plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is less than the candidate sample number threshold (step S26 → yes), performing step S28; in response to determining that the total number of the plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is not less than the candidate sample number threshold (step S26 → no), step S27 is performed. The threshold value of the number of candidate samples may be predetermined. The purpose of step S26 is that when the number of candidate voiceprint feature vector samples of multiple candidate nodes is not large (step S26 → yes), the detailed matching procedure can be directly performed without waiting for all candidate nodes to belong to the leaf node.

Briefly, if it is determined that the N candidate nodes all belong to leaf nodes (step S25 → yes) or the total number of the candidate voiceprint feature vector samples in the candidate nodes is less than the threshold value of the number of candidate samples, the processor 110 may start a detailed matching procedure (e.g., steps S28-S30).

In step S27, the processor 110 inserts the second sub-nodes of each of the candidate nodes into the candidate node sequence to become a new candidate node, and calculates a candidate distance between the node mean voiceprint feature vector and the voiceprint feature vector of each of the new candidate nodes.

Specifically, similar to step S23, in step S27, the processor 110 inserts a plurality of child nodes (also referred to as second child nodes) belonging to the current N candidate nodes retained in the candidate node sequence into the candidate node sequence as new candidate nodes. Then, the processor 110 also calculates new candidate distances between each of the plurality of candidate nodes (the plurality of second child nodes just inserted) and the voiceprint feature vector of the speech data. At this time, all candidate nodes in the candidate node sequence have corresponding candidate distances.

Then, the process returns to step S24 again, in which the processor 110 sorts all the candidate nodes in the current candidate node sequence according to all the candidate distances again, and retains the top N candidate nodes.

That is, the steps S24, S25, S26, S27 are continuously performed in sequence until all candidate nodes belong to the leaf node (S25 → y) or the total number of the candidate voiceprint feature vector samples in all candidate nodes is less than the threshold number of candidate samples (S26 → y).

In step S28, the processor 110 identifies one or more candidate voiceprint feature vector samples for each of all candidate nodes in the sequence of candidate nodes. Specifically, the processor 110 identifies one or more voiceprint feature vector samples (also referred to as candidate voiceprint feature vector samples) that each candidate node currently in the sequence of candidate nodes has, each candidate voiceprint feature vector sample corresponding to a user (also referred to as a candidate user or candidate).

Next, in step S29, the processor 110 compares all the candidate voiceprint feature vector samples with the voiceprint feature vector to find a target candidate voiceprint feature vector sample matching the voiceprint feature vector in all the candidate voiceprint feature vector samples. The following description will be made in detail with reference to fig. 3B.

Fig. 3B is a flowchart illustrating a process of searching for a target candidate voiceprint feature vector sample that matches a voiceprint feature vector of received speech data according to an embodiment of the invention. Referring to fig. 3B, in step S321, the processor 110 identifies a plurality of candidate voiceprint feature vectors of a plurality of candidate voiceprint feature vector samples, wherein the plurality of candidate voiceprint feature vector samples correspond to a plurality of candidate users. Next, in step S322, the processor 110 calculates a plurality of distances between the plurality of candidate voiceprint feature vectors corresponding to each sample of candidate voiceprint feature vectors and the voiceprint feature vector of the received speech data. In step S323, the processor 110 identifies a minimum distance of the plurality of distances as a target distance.

That is, similar to the above-described manner of calculating the distances, the processor 110 may calculate the distances between each of the plurality of recognized candidate voiceprint feature vectors and the voiceprint feature vector of the speech data, and obtain a target distance (the smallest of the calculated plurality of distances).

Next, in step S324, the processor 110 determines whether the target distance is smaller than a matching distance threshold. The matching distance threshold may be predetermined.

In response to determining that the target distance is less than the matching distance threshold (step S324 → yes), performing step S325; in response to determining that the target distance is not less than the matching distance threshold (step S324 → no), step S326 (or step S329) is performed.

In step S325, the processor 110 determines the candidate voiceprint feature vector sample to which the candidate voiceprint feature vector corresponding to the target distance belongs as the target candidate voiceprint feature vector sample matching the voiceprint feature vector. Specifically, when the target distance is smaller than the matching distance threshold, the processor 110 may consider that the candidate voiceprint feature vector corresponding to the target distance is approximate to/matches the voiceprint feature vector of the speech data. Thus, the processor 110 may determine that the candidate voiceprint feature vector sample corresponding to the target distance has a matching relationship with the speech data, i.e., the processor 110 has found the target candidate voiceprint feature vector sample matching the voiceprint feature vector among all candidate voiceprint feature vector samples.

Conversely, when the target distance is not less than the matching distance threshold, in step S326, the processor 110 determines that none of the candidate voiceprint feature vector samples has a target candidate voiceprint feature vector sample matching the voiceprint feature vector. That is, none of the candidate voiceprint feature vectors of each sample of candidate voiceprint feature vectors approximates the voiceprint feature vector of the speech data. Next, in step S327, the processor 110 identifies a target node including a candidate voiceprint feature vector sample to which a candidate voiceprint feature vector corresponding to the target distance belongs, and a target parent node of the target node. Briefly, the processor 110 will identify the node (also called target node) to which the candidate voiceprint feature vector sample corresponding to the target distance belongs and the parent node (also called target parent node) of the target node (the child node of the target parent node is the target node).

In step S328, the processor 110 generates a new child node connected to the target parent node, and adds the voiceprint feature vector to a voiceprint feature vector sample of the new child node corresponding to a new user of the voice data. That is, the process of steps S326-S328 can be regarded as that after determining that none of the plurality of candidate voiceprint feature vector samples has the target candidate voiceprint feature vector sample matching the voiceprint feature vector, the processor 110 considers that the voice data is from a new user (also referred to as an unregistered user) and generates a new sub-node connected to the target node, so that the generated new sub-node is used to contain the voiceprint feature vector sample corresponding to the voiceprint feature vector of the voice data (new user).

In another embodiment, processor 110 does not generate a new node for recording/containing voiceprint feature vectors of the speech data from new/unregistered users. That is, in response to determining that the target distance is not less than the matching distance threshold, the processor 110 does not perform step S326 but performs step S329. In step S329, the processor 110 determines that the received voice data cannot be matched, and determines that the user corresponding to the voice data is an unregistered user.

Referring back to fig. 2, after finding the target candidate voiceprint feature vector sample, in step S30, the processor 110 identifies the target user and the target user information corresponding to the target candidate voiceprint feature vector sample, and maps the target user and the target user information to the voice data, thereby completing the identification operation corresponding to the voice data. Specifically, in response to determining that the target candidate voiceprint feature vector sample matches the speech data, processor 110 considers the user corresponding to the target candidate voiceprint feature vector sample (also referred to as the target user) to be the user who produced speech corresponding to the speech data. Processor 110 may map (e.g., associate) the voice data and the target user/target user information with one another. In addition, the voice data and the voiceprint information/voiceprint feature vector corresponding to the voice data can be further updated to the target user information stored in the user information database 121. At this time, the processor 110 may also clearly recognize that the voice data comes from the target user (a target registered user of the plurality of registered users), and the processor 110 may recognize the authority/instruction set corresponding to the target user according to the target user information (e.g., special authentication information in the target user information). The processor 110 may further perform corresponding subsequent operations according to different application contexts and identified permissions/instruction sets.

Fig. 9A is a diagram illustrating a user recognizing voice data according to an embodiment of the invention. Referring to fig. 9A, it is assumed that the loaded voiceprint feature clustering model has a plurality of nodes a, B, C, AB, EFD, EFDC, EFDCAB, the nodes a, B, C, AB, EFD, EFDC, EFDCAB belong to 3 layers, the upper limit of the candidate node is preset to 2, and the threshold of the number of candidate samples can be preset. Further, assume that processor 110 receives a voiceprint feature vector T of speech data to be recognized. It should be noted that the leaf node of the voiceprint feature clustering model in this example is node A, B, C, EFD.

First, the processor 110 inserts the child nodes AB, EFDC of the root node EFDCAB into the sequence of candidate nodes, calculates a candidate distance between the node mean voiceprint feature vector and the voiceprint feature vector T of each of the candidate nodes AB, EFDC, and sorts the candidate nodes AB, EFDC according to the plurality of candidate distances (assuming that the candidate distance corresponding to the candidate node EFDC is smaller than the candidate distance corresponding to the candidate node AB and the candidate node EFDC is sorted before the candidate node AB). As indicated by arrow a91, processor 110 may consider the voiceprint feature vector T to be more closely associated with candidate node EFDC because the candidate distance of candidate node EFDC is the smallest.

Next, in response to determining that the candidate nodes AB and EFDC are not both leaf nodes and that the total number of the plurality of candidate voiceprint feature vector samples in the candidate nodes AB and EFDC is not less than the candidate sample number threshold, the processor 110 inserts the child node A, B and the child node C, EFD of each of the candidate nodes AB and EFDC into the sequence of candidate nodes (step S27), calculates a candidate distance between the node average voiceprint feature vector of each of the candidate nodes A, B, C, EFD and the voiceprint feature vector T, and sorts the candidate nodes A, B, C, EFD according to the plurality of candidate distances (assuming that the candidate distance corresponding to the candidate node EFD is less than the candidate distance corresponding to the candidate node C, the candidate node EFD is sorted before the candidate node A, B, and the candidate node A, B is sorted after the other candidate nodes). At this time, since the candidate node upper limit value is 2, the processor 110 retains only the first two candidate nodes EFD, C in the candidate node sequence (step S24).

Then, in response to the candidate nodes EFD, C being both leaf nodes (step S25 → yes), the processor 110 starts to perform detailed matching (e.g., steps S28-S30) as indicated by arrows a92, a 93. Details of the matching will be described below with reference to fig. 9B to 9C.

Fig. 9B-9C are schematic diagrams illustrating matching voiceprint feature vectors according to an embodiment of the invention. First, the processor 110 identifies one or more candidate voiceprint feature vector samples for each of all candidate nodes in the sequence of candidate nodes.

Referring to fig. 9B and 9C, it is assumed that each voiceprint feature vector has 2-dimensional feature values (a first-dimensional feature value and a second-dimensional feature value), and as shown in fig. 9B, since it is two-dimensional, for convenience of illustration, fig. 9B marks the candidate node C, EFD and the voiceprint feature vector sample C, E, F, D included in the candidate node C, EFD by using a rectangular coordinate system. The voiceprint feature vectors T of the speech data to be recognized are also labeled to the coordinate systems in fig. 9B, 9C. As shown in fig. 9C, the processor 110 may identify the voiceprint feature vectors C1 and C2 of the voiceprint feature vector sample of the candidate node C, the voiceprint feature vector sample E, F, D of the identified candidate node EFD, and the voiceprint feature vectors E1, E2, F1, F2, D1, and D2 of the voiceprint feature vector sample E, F, D.

Then, the processor 110 calculates the distance between each of the voiceprint feature vectors C1, C2, E1, E2, F1, F2, D1, D2 and the voiceprint feature vector T according to the distance between each of the voiceprint feature vectors C1, C2, E1, E2, F1, F2, D1, D2 and the voiceprint feature vector

And comparing all the candidate voiceprint characteristic vectors with the voiceprint characteristic vector, thereby finding out the target candidate voiceprint characteristic vector which is matched with the voiceprint characteristic vector in all the candidate voiceprint characteristic vectors. For example, the processor 110 may identify a minimum distance

Is a target distance, and determines a target distance

Whether it is smaller than the matching distance threshold (S324). Suppose that processor 110 determines a target distance

If the matching distance threshold is less than the matching distance threshold, the processor 110 determines that the corresponding target candidate voiceprint feature vector D1 matches the voiceprint feature vector T and determines that the candidate voiceprint feature vector sample D to which the candidate voiceprint feature vector D1 belongs is the target candidate matching the voiceprint feature vector TVoiceprint feature vector samples.

Briefly, as shown in the process of fig. 3B, the processor 110 identifies distances between all candidate voiceprint feature vectors in the candidate nodes and the voiceprint feature vectors of the speech data to be identified, so as to find out whether there is a target candidate voiceprint feature vector (by matching the distance threshold) that can match the voiceprint feature vectors in all the candidate voiceprint feature vectors and corresponding target candidate voiceprint feature vector samples. The matched target candidate voiceprint feature vector can correspond to a corresponding voiceprint, and information associated with such voiceprint can be utilized to process subsequent operations, such as control operations with respect to the speech data, and so forth. In addition, the user information associated with the target candidate voiceprint feature vector sample may also be utilized to process other subsequent operations. For example, from the target candidate voiceprint feature vector, the processor 110 can identify a voice control command or voice input corresponding to the voice data; from the target candidate voiceprint feature vector samples, the processor 110 can identify personal information/special verification information (user information) of the speaker.

On the other hand, referring back to fig. 9A, assuming that in another embodiment, the processor 110 determines that the target distance is not less than the matching distance threshold (S326), as indicated by the arrow a94, the processor 110 generates a new child node T according to the target parent node EFDC of the target node EFD (S327) and adds the voiceprint feature vector T to the node (S328).

Voiceprint registration refers to a process of adding acquired voiceprint information (voiceprint feature vectors) to a current voiceprint feature clustering model to prepare for subsequent voice recognition operation.

Fig. 4A is a flowchart illustrating a process of registering a user according to an embodiment of the invention. Referring to fig. 4A, in step S41, the processor 110 receives a registration voice data and generates a corresponding registration voiceprint feature vector according to the registration voice data. Specifically, the processor 110 may receive voice data from the input/output device 140 or the communication circuit unit 150. For example, a user who wants to be registered in the system (also called a registered user) speaks a voice to the electronic device 10, receives the voice through a microphone of the electronic device 10 to generate registered voice data, and the registered voice data is transmitted to the processor 110. Then, the processor 110 performs a voiceprint conversion operation on the received enrollment voice data and identifies a plurality of feature values in the enrollment voiceprint information corresponding to a plurality of feature conditions to generate an enrollment voiceprint feature vector corresponding to the enrollment voiceprint information. The registered voiceprint feature vector can include feature values in M dimensions. M is a positive integer.

In step S42, the processor 110 loads a voiceprint feature clustering model, wherein the voiceprint feature clustering model includes a plurality of nodes, wherein the plurality of nodes are divided into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes includes a root node, a plurality of leaf nodes, and a plurality of relay nodes, and each of the plurality of nodes includes one or more voiceprint feature vector samples. Step S42 is the same as step S22, and details thereof are not repeated.

Next, in step S43, the processor 110 identifies a target registration node closest to the registered voiceprint feature vector in each layer from the first layer to the last layer of the multi-layer tree structure, registers the registered voiceprint feature vector into a plurality of target registration nodes, and updates the total number of samples of each of the plurality of target registration nodes, wherein two target registration nodes in two adjacent layers have a parent-child relationship therebetween.

Fig. 4B is a schematic diagram of registering a user according to an embodiment of the invention. Referring to fig. 4B, assume that the loaded voiceprint feature clustering model has multiple nodes a, B, C, D, E, F, AB, EF, EFD, EFDC, EFDCAB, and the multiple nodes a, B, C, D, E, F, AB, EF, EFD, EFDC, EFDCAB are grouped into multiple layers (5 layers) (e.g., a first layer with only a root node EFDCAB, and a last/last layer with a node E, F on the opposite side of the first layer). Processor 110 calculates the distance between the registered voiceprint feature vector T and the node average voiceprint feature vector for each node to find the target registered node in each layer that is closest to the registered voiceprint feature vector T.

For example, the target registration node EFDCAB in the first layer is closest to the registration voiceprint feature vector T (the target registration node EFDCAB is the only node of the first layer, the root node), the processor 110 and logs the registration voiceprint feature vector T into the target registration node EFDCAB. It should be noted that in the step of registering the registered voiceprint feature vector T to the target registration node EFDCAB, the processor 110 may generate a new voiceprint feature vector sample for recording the registered voiceprint feature vector T in the target registration node EFDCAB, and update the total number of samples corresponding to the target registration node EFDCAB (e.g., add 1 to the original total number of samples).

Next (as indicated by arrow a 41), the processor 110 compares the distances between the child nodes AB, EFDC of the target registration node EFDCAB in the next layer and the registration voiceprint feature vector T, determines that the target registration node EFDC in the second layer is closest to the registration voiceprint feature vector T, and registers the registration voiceprint feature vector T to the target registration node EFDC.

Next (as indicated by arrow a 42), the processor 110 compares the distances between the respective child nodes C, EFD of the target registration node EFDC in the next layer and the registration voiceprint feature vector T, determines that the target registration node EFD in the third layer is closest to the registration voiceprint feature vector T, and logs the registration voiceprint feature vector T to the target registration node EFD.

Next (as indicated by arrow a 43), the processor 110 compares the distances between the respective child nodes D, EF of the target registration node EFD in the next layer and the registration voiceprint feature vector T, determines that the target registration node EF in the fourth layer is closest to the registration voiceprint feature vector T, and registers the registration voiceprint feature vector T to the target registration node EF.

Next (as shown by arrow a 44), processor 110 determines, at the distance between subnode E, F of target registration node EF in the next layer and the registration voiceprint feature vector T, that target registration node F in the last layer is closest to the registration voiceprint feature vector T (e.g., in the last layer, the distance between the node mean voiceprint feature vector of target registration node F and the registration voiceprint feature vector T is less than the distance between the node mean voiceprint feature vector of target registration node E and the registration voiceprint feature vector T), and registers the registration voiceprint feature vector T with target registration node F.

After logging in the registered voiceprint feature vector T to the target registration nodes, in step S44, the processor 110 identifies a target registration node located at the last layer among the target registration nodes as a first target leaf node, and identifies a target distance between a node mean voiceprint feature vector of the first target leaf node and the registered voiceprint feature vector. Next, in step S45, the processor 110 determines whether the target distance is smaller than a registered distance threshold. Specifically, after the registered voiceprint feature vector T is registered to a leaf node (also referred to as a first target leaf node) of the target registered nodes, the processor 110 can no longer find a child node of the first target leaf node, which is considered to be a final node to which the registered voiceprint feature vector T should belong. Therefore, the processor 110 performs step S45 to confirm with the registration distance threshold.

In response to determining that the target distance is smaller than the registered distance threshold (step S45 → yes), performing step S46; in response to a determination that the target distance is not less than the registration distance threshold value (step S45 → no), step S47 is performed.

In step S46, login user information of the registered user corresponding to the registered voiceprint feature vector is logged in to the voiceprint feature vector sample corresponding to the registered user in the first target leaf node. Specifically, in this embodiment, the registered user information of the registered user is only logged in the final node for recording the registered voiceprint feature vector T, so as to save the use of resources. And after the login of the information of the registered user is completed, the registration operation of the registered user is completed.

On the other hand, if it is determined that the target distance is not less than the registration distance threshold (step S45 → no), the processor 110 determines that the first target leaf node is not the final node to which the registered voiceprint feature vector T should belong, and the processor 110 performs steps S47-S48 to perform corresponding processing.

Specifically, in step S47, processor 110 removes the registered voiceprint feature vector from the first target leaf node and identifies the parent node of the first target leaf node as the target parent node.

That is, when the processor 110 determines that the first target leaf node is not the final node to which the registered voiceprint feature vector T should belong, the processor 110 deletes the voiceprint feature vector samples corresponding to the registered voiceprint feature vector T from the first target leaf node, and the first target leaf node is split into two new leaf nodes, wherein one new leaf node contains the node average voiceprint feature vector corresponding to the first target leaf node, and the other new leaf node contains the registered voiceprint feature vector T and contains the voiceprint feature vector samples for registering the registered voiceprint feature vector T. That is, in step S48, the processor 110 generates a second target leaf node connected to the target parent node, and logs the registered user information of the registered user corresponding to the registered voiceprint feature vector and the registered voiceprint feature vector into the voiceprint feature vector sample corresponding to the registered user in the second target leaf node.

For example, as shown by arrow A45 of FIG. 4B, assume that the registered voiceprint feature vector T is eventually registered at node D, but processor 110 determines that the distance between the node average voiceprint feature vector of the first target leaf node D and the registered voiceprint feature vector T is not less than the registration distance threshold. In this example, the processor 110 deletes the voiceprint feature vector samples corresponding to the registered voiceprint feature vector T in the first target leaf node D, and the first target leaf node D is split into two new leaf nodes, wherein one new leaf node DT contains the node average voiceprint feature vector corresponding to the first target leaf node D, and another new leaf node (the second target leaf node T) is connected to the first target leaf node D, and registers the registered voiceprint feature vector T and the corresponding registered user information into the second target leaf node T (as shown by arrow a 46).

The following also describes a method for establishing a voiceprint feature clustering model by using a plurality of drawings.

Fig. 5A is a flowchart illustrating a process of establishing a voiceprint feature clustering model according to an embodiment of the invention. Referring to fig. 5A, in step S51, the processor 110 extracts a plurality of voice data of each of a plurality of users from a plurality of pieces of user information corresponding to the plurality of users in the user information database. In step S52, the processor 110 generates a plurality of voiceprints for each of the plurality of users from the plurality of speech data for each of the plurality of users. The more diversity the plurality of users should have.

In step S53, the processor 110 calculates a plurality of M-dimensional voiceprint feature vectors corresponding to the respective plurality of voiceprints for the respective plurality of users from the respective plurality of voiceprints for the respective plurality of users, where M is a positive integer. For example, M may be 400.

Next, in step S54, the processor 110 calculates an average voiceprint feature vector of each of the plurality of users from the plurality of M-dimensional voiceprint feature vectors of each of the plurality of users, so as to use the plurality of average voiceprint feature vectors of the plurality of users as sample average voiceprint feature vectors of each of a plurality of samples of voiceprint feature vectors. Specifically, the processor 110 obtains an average voiceprint feature vector of a user by dividing the sum of a plurality of feature values of each same dimension of a plurality of M-dimensional voiceprint feature vectors of the user by the total number of the plurality of M-dimensional voiceprint feature vectors to obtain an average of feature values corresponding to each dimension. For example, according to a user having two voiceprint feature vectors "111" and "335" (assuming that M is 3), the calculated average voiceprint feature vector is "223".

After calculating the average voiceprint feature vector for each user, the processor 110 can generate a plurality of voiceprint feature vector samples corresponding to the plurality of users according to the plurality of average voiceprint feature vectors.

Next, in step S55, the processor 110 performs a multi-level unsupervised clustering operation on the voiceprint feature vector samples based on the average voiceprint feature vector of the samples, and groups the voiceprint feature vector samples into a plurality of nodes of a plurality of layers to build a voiceprint feature clustering model of a multi-level tree structure. Details of step S55 are described below with reference to fig. 5B.

Fig. 5B is a flowchart illustrating a multi-level unsupervised clustering operation according to an embodiment of the invention. Referring to fig. 5B, in step S551, the processor 110 calculates a distance between P sample average voiceprint feature vectors as an initial distance between P samples of voiceprint feature vectors according to the sample average voiceprint feature vectors of the P samples of voiceprint feature vectors. Specifically, for each voiceprint feature vector sample, the processor 110 calculates an initial distance between the voiceprint feature vector sample and the other P-1 voiceprint feature vector samples by using an average voiceprint feature vector (also called a sample average voiceprint feature vector) of each voiceprint feature vector sample. That is, after completing step S551, the processor 110 may identify the distance (initial distance) of each voiceprint feature vector sample from each other.

Next, in step S552, the processor 110 sets each voiceprint feature vector sample to be divided into independent nodes, and calculates node distances between the plurality of nodes from each other according to a plurality of initial distances. Briefly, in step S552, the processor 110 initially establishes a plurality of basic nodes of the voiceprint feature clustering model (as shown in fig. 6A, the set basic node is the node A, B, C, D, E, F), and identifies the node distances between the plurality of basic nodes (the distance between each basic node is equal to the initial distance between the voiceprint feature vector samples).

Next, in step S553, the processor 110 selects, as target nodes, Q closest nodes, respectively, from all nodes having no parent node, according to a plurality of node distances, and merges the Q target nodes into parent nodes of the Q target nodes. Q is used to describe the number of branches per node of the model. The Q target nodes are each child nodes of the parent node, wherein Q is a positive integer greater than 1.

Fig. 6A to 6C are schematic diagrams illustrating node cluster merging according to an embodiment of the invention. Referring to fig. 6A, 6B, and 6C, it is assumed that the processor 110 calculates a plurality of average voiceprint feature vectors A, B, C, D, E, F corresponding to the user A, B, C, D, E, F and calculates corresponding distances. For convenience of illustration, it is assumed that the average vocal print feature vectors A, B, C, D, E, F are all 2-dimensional, and the distance/position relationship between them can be described by using a rectangular coordinate system, as shown in fig. 6A. From fig. 6A, it can be found that the two closest basic nodes are node E, F (in this example, Q is equal to 2), and the two next closest basic nodes are node A, B.

Next, as shown in fig. 6B, according to a plurality of node distances (see the distance relationship between the basic nodes of fig. 6A), the closest 2 nodes E, F are selected from all nodes without parent nodes as target nodes respectively, and the 2 target nodes E, F are merged into the parent node EF of the 2 target nodes.

Next, in step S554, processor 110 records node information corresponding to the parent node, wherein the node information corresponding to the parent node includes a node average voiceprint feature vector of the parent node, a node radius of the parent node, and a total sample number of the parent node. The node average voiceprint feature vector is an average value of a plurality of sample average voiceprint feature vectors of a plurality of voiceprint feature samples of a corresponding parent node. The total sample number is the total number of the plurality of voiceprint feature samples possessed by the corresponding parent node. Processor 110 calculates Q distances between the node average voiceprint feature vector of the parent node and the sample average voiceprint feature vectors of each of the Q target nodes (i.e., Q child nodes of the parent node), and takes the largest of the plurality of distances as the node radius of the parent node. For example, after the processor 110 calculates the node average voiceprint feature vector of the node EF, the processor calculates two distances of the node average voiceprint feature vector of the node EF from the node E, F, and selects the largest of the two distances as the node radius. The node radius describes a node mean voiceprint feature vector based on a node EF, which may be included in a distance range of the node EF.

In step S555, the processor 110 estimates node distances between the parent node and the other nodes according to the initial distances between all the voiceprint feature vector samples in the parent node and all the voiceprint feature vector samples of the other nodes.

Fig. 8 is a schematic diagram illustrating a distance between nodes according to an embodiment of the invention. Referring to fig. 8, after the merge node E, F is the node EF, the processor 110 may estimate the node distances (also referred to as estimated distances) between the node EF and other nodes. For example, assume that processor 110 wants to estimate the distance between node EF and node C, D. The processor 110 may estimate the distance between the node EF and the node C, D using the initial distances of all of the elemental nodes of the node EF to all of the elemental nodes of the node C, D. In more detail, the processor 110 may identify an initial distance between node E and node C, D

And

and identifying an initial distance between node F and node C, D

And

next, the processor 110 may calculate an initial distance

And

is taken as the estimated distance between the node EF and the node C

And calculating an initial distance

And

is taken as the estimated distance between the node EF and the node D

After estimating the node distance between the parent node and other nodes, in step S556, the processor 110 determines whether the merged parent node has P voiceprint feature vector samples. The processor 110 determines whether the merged parent node is the root node of the model (there will be all voiceprint feature vector samples (P)) using step S556.

In response to determining that the merged parent node has P voiceprint feature vector samples, performing step S557; in response to determining that the merged parent node does not have P voiceprint feature vector samples (the merged parent node is not the root node), steps S553, S554, S555, and S556 are performed again until the merged parent node has P voiceprint feature vector samples.

When the merged parent node has all the voiceprint feature vector samples (i.e., the root node has been generated), processor 110 may obtain the 2-way tree model shown in fig. 6C. As shown in phantom in fig. 6A, the merged parent EF includes the voiceprint feature vector samples E, F possessed by the corresponding child E, F; the merged parent node AB includes the voiceprint feature vector samples A, B possessed by the corresponding child node A, B; the merged parent EFD includes voiceprint feature vector samples E, F, D of the corresponding child EF, D; the merged parent EFDC includes voiceprint feature vector samples E, F, D, C possessed by the corresponding child EFDs, C; the merged parent node EFDCAB includes the voiceprint feature vector samples E, F, D, C, A, B that the corresponding child node EFDC, AB has, i.e. the root node EFDCAB includes all the voiceprint feature vector samples A, B, C, D, E, F.

Next, in step S557, the processor 110 performs a pruning operation on the current first multi-level tree structure having all nodes to update the first multi-level tree structure to a second multi-level tree structure, thereby completing the establishment of the voiceprint feature clustering model, wherein the total number of nodes and the total number of layers of the second multi-level tree structure are less than the total number of nodes and the total number of layers of the first multi-level tree structure. Specifically, to reduce the size of the model, the processor 110 further prunes the model to reduce the total number of layers and nodes of the model, thereby improving the efficiency of searching the model.

Fig. 7B is a schematic diagram illustrating a trimming operation according to an embodiment of the invention. Referring to fig. 7A and 7B, it is assumed that the processor 110 performs a pruning operation on the first multi-level tree structure (having 5 levels) in fig. 7A. The processor 110 determines that the first multi-level tree structure needs to remove 2 layers (5-3-2) according to a predetermined total number of layers (e.g., 3). As shown in fig. 7B, the processor 110 starts to remove/prune from the last layer, and determines all nodes at layers 4 and 5 (the last layer) as the pruning/removing targets. In this embodiment, the node information of the leaf node in the removed layers and the user information of all the included voiceprint feature vector samples are recorded to the most relevant parent node (e.g., node EFD) accordingly. At this time, the node EFD becomes a leaf node of the pruned second multilevel tree structure. After the pruning operation is completed, the building operation of the voiceprint feature clustering model is completed, and the pruned second multi-level tree structure is recorded as the voiceprint feature clustering model by the processor 110. The voiceprint feature clustering model can be recorded to a voiceprint feature clustering model database 122.

In the above embodiment, the processor 110 identifies the trimming target according to the predetermined total number of layers, but the invention is not limited thereto. For example, in another embodiment, the processor 110 may also use other rules to identify the pruning targets. For example, the processor 110 may identify one or more parent nodes having a total number of samples less than a total number of samples threshold according to the total number of samples of each parent node, and set a plurality of child nodes of the one or more parent nodes as pruning targets. That is, when the total number of samples (the number of users) of a parent node is not much, the processor 110 may directly set the parent node as a leaf node (by removing/pruning a plurality of child nodes of the parent node from the model).

For another example, in yet another embodiment, processor 110 may utilize the node radius of each parent node to determine whether the child node of each parent node may be set as a pruning target. Specifically, the processor 110 may identify one or more parent nodes having a node radius smaller than a node radius threshold according to the node radius of each parent node, and set a plurality of child nodes of the one or more parent nodes as pruning targets. That is, when the node radius of a parent node is less than the node radius threshold, the processor 110 identifies that the child nodes of the parent node are very close to each other, resulting in a reduced degree of authentication for the child nodes that subsequently search for the parent node. Accordingly, the processor 110 may directly set the parent node as a leaf node, prune/remove all children nodes of the parent node, and add user information of a plurality of leaf nodes among all the pruned children nodes to the parent node.

It should be noted that the above steps S551 to S557 may be referred to as a multi-level unsupervised clustering operation, and the multi-level unsupervised clustering operation does not require any manual supervision (i.e., does not require a manager to select, classify, and correct) and can establish a voiceprint feature clustering model.

In addition, in an embodiment, after the target user information matching the voiceprint feature vector of the voice data to be recognized is recognized, the processor 110 may further read special verification information (e.g., a mobile phone number, an identity card number, a employee id card number, etc.) of the target user information, and compare whether information in the voice data to be recognized (or other input information obtained in the voice recognition operation) corresponds to the special verification information, thereby increasing the recognition accuracy.

For example, a user may telephone the electronic device 10 by dialing to transmit voice data to the electronic device 10. The electronic device can identify the telephone number of the user besides receiving the voice data to identify the corresponding target user information, and further compares whether the telephone number in the identified target user information is matched with the telephone number of the telephone currently used by the user, so that further verification operation is completed.

The electronic device 10 is, for example, a security system for a gate of an office. The user may speak his or her own employee's license number before securing the system, i.e., the voice data received by the electronic device 10 may include additional information "employee number" (which may be a full number or a partial number). At this time, the electronic apparatus 10 may identify the work zheng number of the user in addition to receiving the voice data to identify the corresponding target user information, and further compare whether the employee id number in the identified target user information matches the extra information (employee id number) carried by the voice data, thereby completing the further verification operation. When the two are equal to each other, the electronic device 10 confirms that the user is a legitimate user (through authentication), and performs unlocking of the gate.

In addition, it should be noted that, the setting of the threshold value mentioned in any of the above embodiments depends on the existing model parameters and the actual application scenario and the computer interface performance of the server.

In summary, the voice recognition method and the electronic device provided by the embodiments of the invention can match the voiceprint feature vectors of the received voice data to be recognized according to the clustered nodes in the voiceprint feature clustering model through the voiceprint feature clustering model, so as to accurately and efficiently search the user information corresponding to the voice data, thereby improving the capability of the electronic device in recognizing the voice data. In addition, the scale of the established voiceprint feature clustering model is reduced through pruning operation under the condition that the performance is not greatly reduced, so that the efficiency of searching the voiceprint feature clustering model is improved.

The above description is only for the preferred embodiment of the present invention, and it is not intended to limit the scope of the present invention, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present invention, therefore, the scope of the present invention should be determined by the claims of the present application.

Claims

1. An electronic device, comprising:

an input/output device;

the storage device is used for recording the user information database and the voiceprint feature clustering model database; and

a processor for processing the received data, wherein the processor is used for processing the received data,

wherein the processor is configured to receive voice data via the input/output device and is further configured to generate corresponding voiceprint feature vectors from the voice data,

wherein the processor is further configured to load a voiceprint feature clustering model from the voiceprint feature clustering model database, wherein the voiceprint feature clustering model comprises a plurality of nodes, wherein the plurality of nodes are partitioned into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes comprise a root node, a plurality of leaf nodes, and a plurality of relay nodes, each of the plurality of nodes comprises a plurality of voiceprint feature vector samples,

wherein the processor is further configured to insert a plurality of first child nodes of the root node into a sequence of candidate nodes to become a plurality of candidate nodes, and to calculate a candidate distance between a node mean voiceprint feature vector and the voiceprint feature vector for each of the plurality of candidate nodes,

wherein the processor is further to rank all candidate nodes according to the calculated plurality of candidate distances and retain only a first N candidate nodes in the sequence of candidate nodes, wherein a first candidate node in the first N candidate nodes has a smallest candidate distance, wherein N is a candidate node upper limit value,

wherein the processor is further configured to determine whether each of the plurality of candidate nodes is one of the plurality of leaf nodes,

wherein, in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes, the processor is further configured to perform the following steps one to three:

the method comprises the following steps: identifying a plurality of candidate voiceprint feature vector samples of all candidate nodes in the sequence of candidate nodes;

step two: comparing all the candidate voiceprint feature vector samples with the voiceprint feature vectors respectively to find out target candidate voiceprint feature vector samples which are matched with the voiceprint feature vectors in all the candidate voiceprint feature vector samples; and

step three: and identifying target users and target user information corresponding to the target candidate voiceprint feature vector samples, and mapping the target users and the target user information to the voice data, thereby completing the identification operation corresponding to the voice data.

2. The electronic device of claim 1, wherein, in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes,

the processor is further configured to determine whether a total number of the plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is less than a candidate sample number threshold,

wherein, in response to determining that the total number of the plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is not less than the candidate sample number threshold, the processor is further configured to insert a plurality of second child nodes of each of the plurality of candidate nodes into the sequence of candidate nodes to become a new plurality of candidate nodes, and calculate a candidate distance between a node average voiceprint feature vector of each of the new plurality of candidate nodes and the voiceprint feature vector, and the processor is further configured to perform the step of sorting all candidate nodes according to the calculated plurality of candidate distances again, and retaining only the first N candidate nodes in the sequence of candidate nodes.

3. The electronic device of claim 2, wherein the processor is further configured to perform steps one through three again in response to determining that the total number of the candidate voiceprint feature vector samples in the candidate nodes is less than the candidate sample number threshold.

4. The electronic device of claim 3, wherein the second step comprises:

the processor identifying a plurality of candidate voiceprint feature vectors for each of the plurality of candidate voiceprint feature vector samples, wherein the plurality of candidate voiceprint feature vector samples correspond to a plurality of candidate users;

the processor calculating a plurality of distances between the plurality of candidate voiceprint feature vectors corresponding to each candidate voiceprint feature vector sample and a voiceprint feature vector of the received speech data;

the processor identifying a smallest distance of the plurality of distances as a target distance; and

the processor determines whether the target distance is less than a matching distance threshold,

and in response to the judgment that the target distance is smaller than the matching distance threshold value, the processor judges the candidate voiceprint feature vector sample to which the candidate voiceprint feature vector corresponding to the target distance belongs as the target candidate voiceprint feature vector sample matching the voiceprint feature vector.

5. The electronic device of claim 4, wherein, in response to determining that the target distance is not less than the matching distance threshold,

the processor determining that none of the plurality of candidate voiceprint feature vector samples have the target candidate voiceprint feature vector sample that matches the voiceprint feature vector;

the processor identifies a target node containing a candidate voiceprint feature vector sample to which a candidate voiceprint feature vector corresponding to the target distance belongs and a target parent node of the target node; and

the processor generates a new child node connected to the target parent node and adds the voiceprint feature vector to a voiceprint feature vector sample of a new user of the new child node corresponding to the voice data.

6. The electronic device of claim 4, wherein in response to determining that the target distance is not less than the match distance threshold, the processor determines that the received voice data cannot be matched and determines that the user corresponding to the voice data is an unregistered user.

7. The electronic device of claim 2, wherein the voiceprint feature clustering model is built via the processor performing a voiceprint feature clustering model building operation in which,

the processor extracts a plurality of voice data of each of a plurality of users from a plurality of pieces of user information of the user information database corresponding to the plurality of users;

the processor generating a plurality of voiceprints for each of the plurality of users from the plurality of speech data for each of the plurality of users;

the processor calculating, from the voiceprints of each of the plurality of users, a plurality of M-dimensional voiceprint feature vectors of each of the plurality of users corresponding to the voiceprints of each of the plurality of users, wherein M is a positive integer;

the processor calculates an average voiceprint feature vector of each of the plurality of users according to the M-dimensional voiceprint feature vectors of each of the plurality of users, so that the average voiceprint feature vectors of the plurality of users are used as sample average voiceprint feature vectors of each of a plurality of voiceprint feature vector samples; and

the processor carries out multilevel unsupervised clustering operation on a plurality of voiceprint characteristic vector samples based on a plurality of sample average voiceprint characteristic vectors, and groups the voiceprint characteristic vector samples into a plurality of nodes of a plurality of layers so as to establish a voiceprint characteristic clustering model of a multilevel tree structure.

8. The electronic device according to claim 7, wherein the total number of the voiceprint feature vector samples is P, wherein in the operation of grouping the voiceprint feature vector samples into the nodes of the layers to build the voiceprint feature clustering model of the multi-layered tree structure, the multi-layered unsupervised clustering operation is performed on the voiceprint feature vector samples based on the sample average voiceprint feature vector,

the processor calculates the distance between the P sample average voiceprint feature vectors according to the respective sample average voiceprint feature vectors of the P voiceprint feature vector samples to serve as the initial distance between the P voiceprint feature vector samples, wherein P is a positive integer;

the processor initially sets that each voiceprint feature vector sample is divided into independent nodes, and calculates node distances among the plurality of nodes according to a plurality of initial distances, wherein the node average voiceprint feature vector of each node in the plurality of nodes is the sample average voiceprint feature vector of the corresponding voiceprint feature vector sample;

the processor selects Q nodes which are closest to all nodes without parent nodes as target nodes respectively according to a plurality of node distances, and combines the Q target nodes into the parent nodes of the Q target nodes, wherein the Q target nodes are child nodes of the parent nodes respectively, and Q is a positive integer larger than 1;

the processor records node information corresponding to the parent node, wherein the node information corresponding to the parent node comprises a node average voiceprint feature vector of the parent node, a node radius of the parent node, and a total sample number of the parent node;

the processor estimates node distances between the parent node and the other nodes according to the initial distances between all the voiceprint feature vector samples in the parent node and all the voiceprint feature vector samples of the other nodes;

the processor determines whether the merged parent node has P voiceprint feature vector samples; and

in response to determining that the merged parent node has the P voiceprint feature vector samples, the processor performs a pruning operation on a current first multi-level tree structure having all nodes to update the first multi-level tree structure to a second multi-level tree structure, so as to complete the establishment of the voiceprint feature clustering model, wherein the total number of nodes and the total number of layers of the second multi-level tree structure are smaller than the total number of nodes and the total number of layers of the first multi-level tree structure, and the parent node having the P voiceprint feature vector samples is the root node of the established voiceprint feature clustering model.

9. The electronic device of claim 8, wherein in response to determining that the merged parent node does not have the P voiceprint feature vector samples, the processor performs the step of selecting the Q closest nodes from all nodes not having the parent node as the target nodes, respectively, according to the plurality of node distances, and merging the Q target nodes into the parent node of the Q target nodes again.

10. The electronic device of claim 8,

the processor calculating an average of the sample average voiceprint feature vectors for each of the Q target nodes as the node average voiceprint feature vector for the parent node,

wherein the processor calculates Q distances between the node average voiceprint feature vector of the parent node and the sample average voiceprint feature vectors of the respective Q target nodes, and takes the largest of the Q distances as the node radius of the parent node,

wherein the processor identifies a total number of all voiceprint feature vector samples for all child nodes in the parent node as the total number of samples for the parent node.

11. A speech recognition method, characterized in that the speech recognition method comprises:

receiving voice data, and generating corresponding voiceprint characteristic vectors according to the voice data;

loading a voiceprint feature clustering model from a voiceprint feature clustering model database, wherein the voiceprint feature clustering model comprises a plurality of nodes, wherein the plurality of nodes are partitioned into a plurality of layers to form a multi-level tree structure, wherein the plurality of nodes comprise one root node, a plurality of leaf nodes, and a plurality of relay nodes, each of the plurality of nodes comprising a plurality of voiceprint feature vector samples;

inserting a plurality of first child nodes of the root node into a candidate node sequence to become a plurality of candidate nodes, and calculating a candidate distance between a node average voiceprint feature vector and the voiceprint feature vector of each of the plurality of candidate nodes;

ranking all candidate nodes according to the calculated plurality of candidate distances and retaining only the first N candidate nodes in the sequence of candidate nodes, wherein the first candidate node in the first N candidate nodes has the smallest candidate distance, wherein N is a candidate node upper limit value;

determining whether each of the candidate nodes is one of the leaf nodes; and

in response to determining that each of the candidate nodes is one of the leaf nodes, performing the following steps one to three:

12. The speech recognition method of claim 11, wherein in response to determining that each of the plurality of candidate nodes is one of the plurality of leaf nodes,

judging whether the total number of a plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is smaller than a candidate sample number threshold value or not;

in response to determining that the total number of the plurality of candidate voiceprint feature vector samples in the plurality of candidate nodes is not less than the candidate sample number threshold, inserting a plurality of second child nodes of each of the plurality of candidate nodes into the sequence of candidate nodes to become a new plurality of candidate nodes, and calculating a candidate distance between a node average voiceprint feature vector of each of the new plurality of candidate nodes and the voiceprint feature vector, and performing the step of sorting all candidate nodes according to the calculated plurality of candidate distances again and retaining only the first N candidate nodes in the sequence of candidate nodes.

13. The speech recognition method of claim 12, wherein steps one through three are performed again in response to determining that the total number of the candidate voiceprint feature vector samples in the candidate nodes is less than the threshold number of candidate samples.

14. The speech recognition method of claim 13, wherein the second step comprises:

identifying a plurality of candidate voiceprint feature vectors for each of the plurality of candidate voiceprint feature vector samples, wherein the plurality of candidate voiceprint feature vector samples correspond to a plurality of candidate users;

calculating a plurality of distances between the plurality of candidate voiceprint feature vectors corresponding to each candidate voiceprint feature vector sample and a voiceprint feature vector of the received voice data;

identifying a smallest distance of the plurality of distances as a target distance; and

determining whether the target distance is less than a matching distance threshold,

and in response to the judgment that the target distance is smaller than the matching distance threshold value, judging the candidate voiceprint feature vector sample to which the candidate voiceprint feature vector corresponding to the target distance belongs as a target candidate voiceprint feature vector sample matching the voiceprint feature vector.

15. The speech recognition method of claim 14, wherein in response to determining that the target distance is not less than the matching distance threshold,

determining that none of the plurality of candidate voiceprint feature vector samples have the target candidate voiceprint feature vector sample that matches the voiceprint feature vector;

identifying a target node containing a candidate voiceprint feature vector sample to which a candidate voiceprint feature vector corresponding to the target distance belongs and a target parent node of the target node; and

generating a new child node connected to the target parent node and adding the voiceprint feature vector to a voiceprint feature vector sample of a new user of the new child node corresponding to the speech data.

16. The speech recognition method of claim 14, wherein in response to determining that the target distance is not less than the matching distance threshold, determining that the received speech data cannot be matched, and determining that the user corresponding to the speech data is an unregistered user.

17. The speech recognition method of claim 12, wherein the voiceprint feature clustering model is built by performing a voiceprint feature clustering model building operation, wherein the voiceprint feature clustering model building operation comprises:

extracting a plurality of voice data of each of a plurality of users from a plurality of pieces of user information corresponding to the plurality of users in a user information database;

generating a plurality of voiceprints for each of the plurality of users from the plurality of speech data for each of the plurality of users;

calculating a plurality of M-dimensional voiceprint feature vectors for each of the plurality of users corresponding to the plurality of voiceprints from the plurality of voiceprints for each of the plurality of users, wherein M is a positive integer;

calculating respective average voiceprint feature vectors of the plurality of users according to the respective plurality of M-dimensional voiceprint feature vectors of the plurality of users to take the plurality of average voiceprint feature vectors of the plurality of users as respective sample average voiceprint feature vectors of a plurality of voiceprint feature vector samples; and

and performing multilevel unsupervised clustering operation on the voiceprint characteristic vector samples based on the average voiceprint characteristic vectors of the samples, and grouping the voiceprint characteristic vector samples into a plurality of nodes of a plurality of layers to establish a voiceprint characteristic clustering model of a multilevel tree structure.

18. The speech recognition method of claim 17, wherein the total number of the voiceprint feature vector samples is P, wherein the step of performing the multi-level unsupervised clustering operation on the voiceprint feature vector samples based on the sample-averaged voiceprint feature vector comprises grouping the voiceprint feature vector samples into the plurality of nodes of the plurality of layers to build the voiceprint feature clustering model of the multi-level tree structure comprises:

calculating the distance between the average vocal print characteristic vectors of P samples according to the average vocal print characteristic vectors of the samples of the P vocal print characteristic vector samples to serve as the initial distance between the P vocal print characteristic vector samples, wherein P is a positive integer;

initially setting each voiceprint feature vector sample to be divided into independent nodes, and calculating node distances among the plurality of nodes according to a plurality of initial distances, wherein the node average voiceprint feature vector of each node in the plurality of nodes is the sample average voiceprint feature vector of the corresponding voiceprint feature vector sample;

selecting Q nodes which are closest to all nodes without parent nodes as target nodes respectively according to a plurality of node distances, and combining the Q target nodes into the parent nodes of the Q target nodes, wherein the Q target nodes are child nodes of the parent nodes respectively, and Q is a positive integer larger than 1;

recording node information corresponding to the parent node, wherein the node information corresponding to the parent node comprises a node average voiceprint feature vector of the parent node, a node radius of the parent node, and a total sample number of the parent node;

estimating node distances between the parent node and the other nodes according to the initial distances between all the voiceprint feature vector samples in the parent node and all the voiceprint feature vector samples of the other nodes;

judging whether the combined father node has P voiceprint characteristic vector samples or not; and

and in response to the judgment that the merged father node has the P voiceprint feature vector samples, performing a pruning operation on a current first multi-level tree structure with all nodes to update the first multi-level tree structure into a second multi-level tree structure, so as to complete the establishment of the voiceprint feature clustering model, wherein the total number of nodes and the total number of layers of the second multi-level tree structure are smaller than the total number of nodes and the total number of layers of the first multi-level tree structure, and the father node with the P voiceprint feature vector samples is the root node of the established voiceprint feature clustering model.

19. The speech recognition method of claim 18, wherein the step of selecting the Q closest nodes from all nodes not having the parent node as the target nodes, respectively, and merging the Q target nodes as the parent nodes of the Q target nodes, respectively, according to the plurality of node distances is performed again in response to determining that the merged parent node does not have the P voiceprint feature vector samples.

20. The speech recognition method of claim 18,

calculating an average of the sample average voiceprint feature vectors for each of the Q target nodes as the node average voiceprint feature vector for the parent node,

wherein Q distances between the node average voiceprint feature vector of the parent node and the sample average voiceprint feature vectors of the respective Q target nodes are calculated, and the largest of the Q distances is taken as the node radius of the parent node,

wherein the total number of all voiceprint feature vector samples identifying all child nodes in the parent node is the total number of samples of the parent node.