CN114547367A

CN114547367A - Electronic equipment, searching method based on audio instruction and storage medium

Info

Publication number: CN114547367A
Application number: CN202210061388.0A
Authority: CN
Inventors: 刘宇; 马明
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-05-27

Abstract

The present disclosure relates to an electronic device, a search method based on audio instructions, and a storage medium, and more particularly, to the field of information interaction technology. The electronic device includes: a controller configured to: in response to a search audio instruction input by a user, extracting a first voiceprint vector from the search audio instruction; under the condition that the first voiceprint vector is matched with at least one second voiceprint vector stored in advance, determining a first cluster to which the at least one second voiceprint vector belongs, wherein the first cluster comprises a plurality of second voiceprint vectors; acquiring user preference information corresponding to the first cluster; the search audio instruction is responded to according to the user preference information. The embodiment of the disclosure is used for solving the problem of complex operation in the existing voiceprint recognition registration stage.

Description

Electronic equipment, searching method based on audio instruction and storage medium

Technical Field

The present disclosure relates to the field of information interaction technologies, and in particular, to an electronic device, a search method based on an audio instruction, and a storage medium.

Background

With the rapid development of the related technologies in the field of smart home, more and more users issue instructions to smart home appliances through voice so as to meet diversified requirements. At present, the voiceprint recognition technology of the intelligent household appliance can accurately match the voiceprint of the user according to the audio of the user only by the user participating in registration, however, the registration process requires the user to upload more than three audios, the operation process is complicated, and the voiceprint used for user registration cannot be updated along with the change of conditions such as age, physical conditions, emotion and the like, so that the accuracy of voiceprint recognition is influenced.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides an electronic device, a search method based on an audio instruction, and a storage medium, which can implement a non-sensory registration, thereby improving a user experience.

In order to achieve the above purpose, the technical solutions provided by the embodiments of the present disclosure are as follows:

in a first aspect, an electronic device is provided, which includes:

a controller configured to:

in response to a search audio instruction input by a user, extracting a first voiceprint vector from the search audio instruction;

under the condition that the first voiceprint vector is matched with at least one second voiceprint vector stored in advance, determining a first cluster to which the at least one second voiceprint vector belongs, wherein the first cluster comprises a plurality of second voiceprint vectors;

acquiring user preference information corresponding to the first clustering cluster;

the search audio instruction is responded to according to the user preference information. As an optional implementation manner of the embodiment of the present disclosure, the controller is further configured to: storing the first voiceprint vector to the first cluster.

As an optional implementation manner of the embodiment of the present disclosure, the at least one second acoustic line vector is a first center vector of the first cluster, and the first center vector is an average vector calculated according to a plurality of second acoustic line vectors.

As an optional implementation manner of the embodiment of the present disclosure, the controller is specifically configured to:

a plurality of clustering clusters are stored in advance, wherein the clustering clusters comprise a first clustering cluster;

obtaining central vectors of a plurality of clustering clusters, wherein the central vectors of the clustering clusters comprise a first central vector;

respectively calculating the similarity between the central vectors of the plurality of clustering clusters and the first voiceprint vector to obtain a plurality of first similarity parameters;

determining a target similarity parameter from the plurality of first similarity parameters;

and under the condition that the target similarity parameter is greater than the first similarity threshold value, taking the cluster corresponding to the target similarity parameter as a first cluster.

As an optional implementation manner of the embodiment of the present disclosure, the controller is further configured to:

and under the condition that the target similarity parameter is smaller than a second similarity threshold value, establishing a second cluster, and storing the first voiceprint vector to the second cluster.

under the condition that the target similarity parameter is less than or equal to a first similarity threshold and greater than or equal to a second similarity threshold, calculating a first divergence parameter and a second divergence parameter;

taking the first voiceprint vector and the voiceprint vector in the cluster corresponding to the target similarity parameter as a cluster to be clustered;

calculating a first divergence parameter according to one cluster to be clustered and a first other cluster, wherein the first other cluster is a cluster corresponding to the target similarity parameter in the plurality of clusters, and the divergence parameter is used for representing the cohesion degree between different clusters;

taking the first voiceprint vector as a new cluster;

calculating a second dispersion parameter according to the new cluster and the plurality of clusters, wherein the dispersion parameter is used for representing the cohesion degree between different clusters;

when the first divergence parameter is smaller than the second divergence parameter, taking the cluster corresponding to the target similarity as a first cluster;

and when the first divergence parameter is larger than the second divergence parameter, establishing a second cluster, and storing the first voiceprint vector to the second cluster.

acquiring a first central vector of any cluster;

acquiring a second central vector of a second other cluster except any cluster;

calculating a similarity parameter between the first center vector and a second center vector of at least one second other cluster to obtain at least one second similarity parameter;

and if the second similarity parameter is greater than the third similarity threshold, merging the first cluster with at least one second other cluster to obtain a merged cluster.

In a second aspect, a method for searching based on audio instructions is provided, the method comprising:

in response to an audio instruction input by a user, extracting a first voiceprint vector from the audio instruction;

responding to the audio instruction according to the user preference information. As an optional implementation manner of the embodiment of the present disclosure, after determining a first cluster to which at least one second voiceprint vector belongs, the first voiceprint vector is stored in the first cluster.

As an optional implementation manner of the embodiment of the present disclosure, determining a first cluster to which at least one second acoustic line vector belongs includes:

As an optional implementation manner of the embodiment of the present disclosure, after determining the target similarity parameter from the plurality of first similarity parameters, the method further includes:

taking the first voiceprint vector and the voiceprint vector in the clustering cluster corresponding to the target similarity parameter as a cluster to be clustered;

calculating a first dispersion parameter according to one cluster to be clustered and a first other cluster, wherein the first other cluster is a cluster corresponding to the target similarity parameter in the plurality of clusters, and the dispersion parameter is used for representing the cohesion degree between different clusters;

taking the first voiceprint vector as a new cluster;

As an optional implementation manner of the embodiment of the present disclosure, the method further includes: acquiring a first central vector of any cluster;

acquiring a second central vector of a second other cluster except any cluster;

In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements an audio instruction based search method as described in the second aspect or any one of its alternative embodiments.

In a fifth aspect, a computer program product is provided, comprising: the computer program product, when run on a computer, causes the computer to implement an audio instruction based search method as the first aspect or any one of its alternative embodiments.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the embodiment of the disclosure provides electronic equipment, which comprises a controller, wherein in an intelligent home scene, the electronic equipment receives an audio instruction of a user, the controller can respond to the audio instruction input by the user, extract a first voiceprint vector of the user from the audio instruction, then match the first voiceprint vector with a second voiceprint vector stored in a voiceprint database, and under the condition that the first voiceprint vector is matched with at least one second voiceprint vector, because the second voiceprint vector matched with the first voiceprint vector belongs to a cluster, the cluster is determined as a first cluster, which indicates that the first voiceprint vector and the first cluster have a corresponding relationship; further, according to the user preference information corresponding to the first cluster, the preference information of the user who inputs the audio instruction can be determined. Through the electronic equipment, when a user sends an audio instruction to the smart home, the smart home acquires the voiceprint vector of the user and stores the voiceprint vector to realize the non-inductive registration, so that the diversified requirements of the user are met, and the use experience of the user is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1A is a schematic diagram of a scenario in some embodiments;

FIG. 1B is a schematic diagram of an application scenario in accordance with one or more embodiments of the present application;

fig. 2 is a block diagram of a hardware configuration of a terminal device according to one or more embodiments of the present application;

fig. 3 is a schematic diagram of a software configuration in a terminal device according to one or more embodiments of the present application;

fig. 4 is a first flowchart illustrating a search method based on audio commands according to one or more embodiments of the present application;

FIG. 5 is a schematic view of a voiceprint model provided by one or more embodiments of the present application;

FIG. 6 is a schematic diagram of clustered information provided in one or more embodiments of the present application;

FIG. 7 is a schematic diagram of a merged cluster provided in one or more embodiments of the present application;

fig. 8 is a flowchart illustrating a search method based on an audio instruction according to one or more embodiments of the present application.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment. It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

At present, when voiceprint recognition is applied to smart homes, a user needs to read fixed registration texts for multiple times to obtain the voice frequency of the user, so that voiceprint information of the user is obtained from the voice frequency for registration, but the registration process needs active participation of the user and is complicated in operation process, the use experience of the user is influenced, and the voiceprint information for registration cannot be updated along with changes of conditions such as age, physical conditions and emotion, and the accuracy of voiceprint recognition is influenced.

In order to solve the above problem, an embodiment of the present disclosure provides an electronic device, where the electronic device includes a controller, and in an intelligent home scenario, the electronic device receives an audio instruction of a user, where the controller may, in response to the audio instruction input by the user, extract a first voiceprint vector of the user from the audio instruction, then match the first voiceprint vector with a second voiceprint vector stored in a voiceprint database, and in a case that it is determined that the first voiceprint vector matches at least one second voiceprint vector, determine a cluster as a first cluster because the second voiceprint vector matching the first voiceprint vector belongs to one cluster, which indicates that the first voiceprint vector corresponds to the first cluster, and then store the first voiceprint vector in the first cluster; further, according to the user preference information corresponding to the first cluster, the preference information of the user who inputs the audio instruction can be determined. Through the electronic equipment, when a user sends an audio instruction to the smart home, the smart home acquires the voiceprint vector of the user and stores the voiceprint vector to realize the non-inductive registration, so that the diversified requirements of the user are met, and the use experience of the user is improved.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the following will briefly introduce the technical terms used in the description of the embodiments or the prior art:

voiceprint Recognition (VPR) is a process of automatically determining whether a speaker is within an established speaker set and who the speaker is by analyzing and extracting a received speaker voice signal.

Clustering (Clustering) is to divide a data set into different classes or clusters according to a certain standard, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is also as large as possible.

In some embodiments, an electronic device may include: examples of the server or the smart Phone (such as an Android Phone, an IOS Phone, a Windows Phone, etc.), the smart refrigerator, the smart television, the smart speaker, the tablet computer, the palm computer, the notebook computer, and the wearable device are only examples, but not exhaustive, and include but not limited to the above devices.

FIG. 1A is a schematic diagram of a scenario in some embodiments. As shown in fig. 1, a user may operate the electronic device 200 through the smart device 300 or the control apparatus 100, and the electronic device 200 performs data communication with the server 400.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the terminal device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and the electronic device 200 is controlled by a wireless or wired method. The user may input a user command through a key on a remote controller, a voice input, a control panel input, etc., to control the electronic apparatus 200.

In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the electronic device 200. For example, the electronic device 200 is controlled using an application running on the smart device.

In some embodiments, the electronic device 200 may also receive the user's control by touch or gesture, etc., instead of using the smart device or control device described above to receive instructions.

In some embodiments, the electronic device 200 may also be controlled by a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the electronic device 200 to obtain the voice command, or may be received by a voice control device provided outside the electronic device 200.

In some embodiments, the electronic device 200 may be allowed to communicatively connect over a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the electronic device 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

Fig. 1B is a schematic view of an application scenario of an electronic device according to one or more embodiments of the present application, as shown in fig. 1B, the intelligent device includes: intelligent refrigerator 110, intelligent audio amplifier 120. A user may speak to issue an audio instruction to an intelligent device in the smart home environment, for example, after the user issues an audio instruction "play a hit song" to the smart sound box 120, a controller configured in the smart sound box 120, in response to the audio instruction "play a hit song", extracts a first voiceprint vector of the user from the audio instruction, then determines at least one second voiceprint vector matching the first voiceprint vector in a voiceprint database, and uses a first cluster to which the second voiceprint vector belongs as a cluster of the first voiceprint vector, that is, determines that the first voiceprint vector belongs to the first cluster.

Further, user preference information corresponding to the first cluster is obtained, then the smart sound box 120 searches popular songs, that is, songs with high playback volume, based on the user preference information, for example, the user preference information indicates that the user likes songs of zhou jilun, then the smart sound box 120 determines songs of zhou jilun with high playback volume from the internet, and plays the songs.

Fig. 2 shows a block diagram of a hardware configuration of an electronic device according to an exemplary embodiment. The electronic apparatus shown in fig. 2 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processor, a video processor, an audio processor, a graphic processor, a RAM, a ROM, and first to nth interfaces for input/output. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the tuner-demodulator 210 may be located in different separate devices, that is, the tuner-demodulator 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 controls the overall operation of the display apparatus 200. The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc. visual interface elements.

An embodiment of the present disclosure provides an electronic device, including:

a controller 250 configured to: in response to an audio instruction input by a user, extracting a first voiceprint vector from the audio instruction; under the condition that the first voiceprint vector is matched with at least one second voiceprint vector stored in advance, determining a first cluster to which the at least one second voiceprint vector belongs, wherein the first cluster comprises a plurality of second voiceprint vectors; acquiring user preference information corresponding to the first clustering cluster; responding to the audio instruction according to the user preference information.

According to the equipment, when the user sends the audio instruction to the smart home, the smart home acquires the voiceprint vector of the user and stores the voiceprint vector to finish the non-inductive registration, and the use experience of the user is improved.

Other steps or functions may also be implemented by the controller 250 in the embodiments of the present disclosure, which are not described in detail below.

Fig. 3 is a schematic diagram of software configuration in an electronic device according to one or more embodiments of the present Application, and as shown in fig. 3, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer. The inner core layer comprises at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

As shown in fig. 4, fig. 4 is a schematic flowchart of a search method based on an audio instruction according to one or more embodiments of the present application, where the method includes:

s401, responding to an audio instruction input by a user, and extracting a first voiceprint vector from the audio instruction.

The audio instruction is an instruction sent to the intelligent household appliance by a user in a voice control mode in an intelligent household scene, for example, the user says: if the "play the lunch news", the "play the lunch news" is an audio command.

The first voiceprint vector is used to represent a voiceprint feature of the user. It should be noted that the voiceprint features are the sound wave spectrum features carrying speech information, and the voiceprint features have specificity and relative stability, that is, the voiceprint features of different users have specificity, and the voiceprint features of the same user have relative stability. Therefore, the voiceprint feature can be extracted and used for user differentiation to determine the audio segment corresponding to the first user.

In some embodiments, in response to an audio instruction input by a user, the audio instruction is input into a voiceprint model, resulting in a voiceprint vector output by the voiceprint model. The embodiment of the present application provides an implementation method for extracting a voiceprint vector by using a voiceprint model: as shown in fig. 5, the voiceprint model is trained by using pre-labeled audio samples and voiceprint vectors, and includes, but is not limited to, a neural network 501, a pooling layer 502, and a full-link layer 503, in the training process, the audio instruction is first split frame by frame to obtain a plurality of audio frames, then audio features of the plurality of audio frames are extracted, then the audio features are input into the neural network 501 to obtain frame-level features, the frame-level features are further input into the pooling layer 502 to perform pooling aggregation to obtain sentence features, finally the sentence features are input into the full-link layer 503 to obtain predicted voiceprint vectors, and the predicted voiceprint vectors are compared with the labeled voiceprint vectors, so that parameters in the model are changed to train to obtain a converged voiceprint model. The audio features may be short-time spectral features such as Mel Frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), Filter Banks (FBank), and the like, or feature vectors extracted based on a pre-training model.

Further, after the converged voiceprint model is obtained, the voiceprint vector is obtained by taking the audio command as an input. It should be emphasized that the present application is not limited to the voiceprint model, and the voiceprint model capable of extracting the voiceprint vector from the audio is suitable for the present application.

In the embodiment, by responding to the audio instruction input by the user and acquiring the first voiceprint vector from the audio instruction, the voiceprint feature is extracted from the audio so as to facilitate the subsequent processing.

S402, under the condition that the first voiceprint vector is matched with at least one second voiceprint vector stored in advance, determining a first cluster to which the at least one second voiceprint vector belongs.

In some embodiments, different cluster clusters are pre-stored in the voiceprint database, as shown in fig. 6, the voiceprint database in fig. 6 includes n cluster clusters, each cluster includes at least one voiceprint vector, the voiceprint vectors in each cluster are similar, and one cluster may correspond to one user. The at least one second pre-stored voiceprint vector is a voiceprint vector included in a different cluster stored in the voiceprint database. The disclosed embodiment provides an implementation manner, first, a first voiceprint vector is matched with at least one second voiceprint vector in a database, and then, a similarity parameter between the first voiceprint vector and the second voiceprint vector is calculated, so as to determine whether the first voiceprint vector belongs to a first cluster to which the second voiceprint vector belongs, wherein the first cluster comprises a plurality of second voiceprint vectors. Under the condition that the first voiceprint vector is matched with the second voiceprint vector, the first voiceprint vector and the second voiceprint vector are determined to belong to the same cluster, namely the first voiceprint vector belongs to the first cluster, and further the first voiceprint vector is stored in the first cluster so as to match the first voiceprint vector with the voiceprint information stored in the database, so that the voiceprint vector can be verified when a user issues an audio command again, a plurality of users can be distinguished in an intelligent home scene, and different requirements of each user can be met conveniently.

Illustratively, after extracting the first voiceprint vector M from the audio instruction, due to the 3 cluster clusters A, B, C included in the voiceprint database, 3 second voiceprint vectors a0, B0, C0 can be obtained from the 3 cluster clusters A, B, C, and then distance measures D between the first voiceprint vector M and the 3 second voiceprint vectors a0, B0, C0 are calculated respectively_a0、D_b0、D_c0The distance measurement and the similarity parameter have a corresponding relation, and the smaller the distance measurement, the larger the parameter is, so that the distance measurement D is determined_a0、D_b0、D_c0Minimum distance measure of D, if D_a0Is the minimum distance metric, it is determined that the first voiceprint vector M belongs to the cluster a to which the second voiceprint vector a0 belongs. Wherein the distance metric comprises any one of: euclidean Distance (Euclidean Distance), manhattan Distance (manhattan Distance), Chebyshev Distance (Chebyshev Distance), mahalanobis Distance (mahalanobis Distance), and the above Distance measures are exemplary and not intended to be limiting.

In some embodiments, the second voiceprint vector may be any one of the voiceprint vectors in the first cluster, or may be a central vector obtained by performing an average calculation on all the voiceprint vectors in the first cluster. The central vector is used for representing the center of the first cluster, is a special voiceprint vector in cluster analysis and is used for representing a certain cluster, and other voiceprint vectors determine whether the other voiceprint vectors belong to the cluster by calculating the distance from the central vector. It should be noted that, as shown in fig. 6, the cluster includes at least one of the following items: center vector, number of vectors, voiceprint vector, covariance matrix, audio storage location, and user preference information. The vector number refers to the number of the voiceprint vectors contained in the cluster; the covariance matrix is calculated based on the voiceprint vectors, for example, if cluster 1 contains 5 voiceprint vectors, the order of the covariance matrix of cluster 1 is 5 × 5; the audio storage position refers to a position where voiceprint vectors contained in the cluster are stored correspondingly; the user preference information is stored in the cluster in a vector form to represent the preference of the user corresponding to the cluster, and it can be understood that the information included in n clusters in the voiceprint database is different according to different corresponding users. In the following discussion, the description of the information included in the cluster will not be repeated.

Under the condition that the second voiceprint vector is the central vector of the first cluster, a plurality of cluster clusters including the first cluster are stored in advance in the voiceprint database, so in the process of determining whether the first voiceprint vector is matched with the second voiceprint vector, namely the first central vector, the central vectors of the cluster clusters in the voiceprint database are firstly obtained, wherein the central vectors include the first central vector of the first cluster, then the similarity between the central vectors of the cluster clusters and the first voiceprint vector is respectively calculated to obtain a plurality of first similarity parameters, then the maximum similarity parameter is determined from the plurality of first similarity parameters to represent that the first voiceprint vector is most matched with one central vector, then the maximum similarity parameter is determined as a target similarity parameter, the threshold of the similarity parameter is preset as a first similarity threshold, and whether the target similarity parameter is larger than the first similarity threshold is judged, and determining that the first voiceprint vector belongs to the cluster corresponding to the target similarity parameter under the condition that the target similarity parameter is greater than the first similarity threshold. By calculating the similarity between the first voiceprint vector and the first center vector, the cluster which is most matched with the first voiceprint vector is determined from the plurality of cluster clusters included in the voiceprint database, and the accuracy of voiceprint clustering is improved.

Illustratively, after extracting the first voiceprint vector M from the audio instruction, because of the 3 cluster clusters A, B, C included in the voiceprint database, calculating the center vectors of the 3 cluster clusters may result in center vectors a1, B1, and C1, and then calculating the distance measures of M from a1, B1, and C1, respectively, thereby resulting in similarity parameters a, B, and C, and then determining the maximum similarity parameter from the similarity parameters a, B, and C. If the maximum similarity parameter is a, further judging whether the similarity parameter a is greater than a preset threshold, and determining that the first voiceprint vector M belongs to the cluster A corresponding to the similarity parameter a under the condition that the similarity parameter a is greater than the preset threshold.

In addition, a second similarity threshold is set as a lower limit for judging the similarity parameter, and when the target similarity parameter is smaller than the second similarity threshold, it is described that the cluster to which the calculated first voiceprint vector belongs does not satisfy the matching condition, so that the first voiceprint vector determined and obtained according to the similarity parameter does not belong to any cluster in the current voiceprint database.

Exemplarily, in the process of issuing an audio instruction to the smart speaker by a user for the first time, cluster clusters corresponding to the voiceprint vectors of the user are not stored in the voiceprint database, so that a first voiceprint vector obtained by extracting the audio instruction and a first center vector in any cluster in the voiceprint database are subjected to similarity parameter calculation, and the obtained similarity parameters are all smaller than a second similarity threshold value.

In general, because the audio frequencies of users are different due to factors such as age, physical condition, emotion and the like of the users, the first voiceprint vectors in the audio command are different, the problem that the cluster to which the first voiceprint vector belongs is difficult to determine exists through calculation of the similarity parameter, and the target similarity parameter obtained through judgment does not meet the two conditions, namely, the first divergence parameter and the second divergence parameter are calculated under the condition that the target similarity parameter is less than or equal to a first similarity threshold and is greater than or equal to a second similarity threshold, the divergence parameter can reflect the cohesion degree of all clusters in the voiceprint database, and the calculation formula of the divergence parameter is shown as formula (1)

Wherein G (c) is a dispersion parameter, c represents the number of clusters, N_jNumber of vectors, det (M), representing the jth cluster_j) The determinant of the covariance matrix of the jth cluster is represented. And determining the cluster to which the first voiceprint vector belongs by calculating a dispersion parameter so as to reduce the influence caused by factors such as age, physical condition and emotion of the user.

In the process of calculating the first divergence parameter and the second divergence parameter, firstly, taking the first voiceprint vector and the voiceprint vector in the clustering cluster corresponding to the target similarity parameter as a cluster to be clustered, namely, supposing that the first voiceprint vector belongs to the clustering cluster corresponding to the target similarity parameter, then, calculating the first divergence parameter according to the cluster to be clustered and the first other clustering cluster, wherein the first other clustering cluster is the clustering cluster corresponding to the target similarity parameter in the clustering clusters; meanwhile, the first voiceprint vector is used as a new cluster; further, whether a first divergence parameter is smaller than a second divergence parameter is judged, and when the first divergence parameter is smaller than the second divergence parameter, a cluster corresponding to the target similarity is used as the first cluster; and when the first divergence parameter is larger than the second divergence parameter, establishing a second cluster, and storing the first voiceprint vector to the second cluster.

Exemplarily, following the above example, assuming that the first voiceprint vector M belongs to the cluster a corresponding to the target similarity parameter, a new cluster S including the first voiceprint vector M is obtained, and then the first divergence parameter G (1) is calculated according to the new cluster S and other clusters B and C in the voiceprint database; meanwhile, under the condition that the first voiceprint vector M does not belong to the cluster A, B, C according to the target similarity parameter, the first voiceprint vector M is taken as a cluster T, and the second divergence parameter G (2) is calculated according to the cluster T and the cluster A, B, C. Comparing the sizes of G (1) and G (2), and if G (1) is smaller than G (2), determining that the first voiceprint vector M belongs to the cluster A; and if G (1) is larger than G (2), determining to establish a cluster T, wherein the cluster T comprises a first voiceprint vector M.

In addition, after the first voiceprint vector is stored in the first cluster, the data in the first cluster is updated, the central vector of the first cluster needs to be recalculated for clustering of other voiceprint vectors, updating is performed through the cluster in the voiceprint database, updating is performed according to the voiceprint vectors included in the user audio command, and updating under the condition that the voiceprint vectors are different along with the factors such as age, physical condition and emotion of the user is achieved.

In some embodiments, after determining that the first voiceprint vector belongs to the first cluster, obtaining a central vector of any cluster in the voiceprint database, obtaining central vectors of other clusters in the voiceprint database, obtaining a plurality of similarity parameters by calculating similarity parameters of a plurality of clusters in the voiceprint database, determining a maximum similarity parameter therefrom, and determining whether the maximum similarity parameter is greater than a third similarity threshold, where it is required to be noted that the third similarity threshold is less than the second similarity threshold.

If the maximum similarity parameter is larger than or equal to the third similarity threshold, further calculating preference similarity parameters between the clusters corresponding to the maximum similarity parameter, and under the condition that the preference similarity parameters are larger than a preset fourth similarity threshold, determining that the clusters corresponding to the maximum similarity parameter can be merged, and obtaining merged clusters after merging. The process of calculating the preference similarity parameter between the clustering clusters corresponding to the maximum similarity parameter is calculated according to the user preference information included in the clustering clusters corresponding to the maximum similarity parameter; and the user preference information is obtained by performing voice analysis on the audio content and is correspondingly stored in the corresponding cluster in a vector form. And if the maximum similarity parameter is smaller than the third similarity threshold, determining that the cluster corresponding to the maximum similarity parameter cannot be merged.

Illustratively, as shown in fig. 7, the cluster clusters in the graph include C1, C2, C3, the center vectors of the cluster clusters C1, C2, C3 are C1, C2, C3, respectively, similarity parameters between the center vector C1 of the cluster C1 and the center vector C2 of the cluster C2, the center vector C3 of the cluster C3, and the center vector C2 of the cluster C2 and the center vector C3 of the cluster C3 are calculated, and similarity parameters k1, k2, k3 are obtained, respectively, the distance between the center vectors shown in fig. 7 may be in inverse proportion to the similarity parameter, wherein the similarity parameter k1 is the largest similarity parameter among the three similarity parameters, and then the similarity parameter k1 is compared with the third similarity threshold k 0. If the similarity parameter k1 is greater than the third similarity threshold k0, calculating a preference similarity parameter between the user preference information u1 of the cluster C1 and the user preference information u2 of the cluster C2 to obtain a preference similarity parameter p, judging whether the preference similarity parameter p is greater than a preset fourth similarity threshold p0, and if the preference similarity parameter p is greater than the fourth similarity threshold p0, determining to merge the cluster C1 and the cluster C2. By the embodiment, similar clustering clusters are combined to correspond to the same user, errors in the clustering process are eliminated, and the accuracy of voiceprint recognition is improved.

S403, acquiring user preference information corresponding to the first cluster.

And the user preference information is obtained based on the audio content in the audio instruction and is stored in the corresponding cluster in a vector form.

For example, a user searches for a song of zhou jilun for multiple times, and the smart speaker may obtain preference information of the user, that is, "like zhou jilun", which may be stored in the voiceprint database as a tag together with a cluster corresponding to the user.

In some embodiments, an audio instruction input by a user is denoised to remove environmental noise, white noise and the like in the audio instruction, then, voice detection is performed through Automatic Gain Control (AGC) with a voice recognition (Speech Sense) algorithm or voice detection is performed by using a voice endpoint detector to extract voice audio, the voice audio is further converted into a text, semantic analysis is performed through a neural network to obtain preference information of the user, and the preference information of the user and a cluster are correspondingly stored in a voiceprint database to form a user portrait, so that the audio instruction of the user can be responded conveniently.

And S404, responding to the audio command according to the user preference information.

In some embodiments, responding to the audio instruction according to the user preference information may be adjusting an audio sound when responding to the user sound. For example, the user's preference information indicates that the user likes a voice responsive audio instruction of a certain mingxing.

Illustratively, if the user is determined to be user 1 according to the audio instruction "play lunch news", and the preference information of the user 1 in the voiceprint database is domestic news and sounds like Ding Zhi Ling, then the domestic lunch news is played by using Ding Ling sounds in response to the audio instruction.

As shown in fig. 8, the present disclosure also provides an audio instruction based search method, which includes the following steps S801 to S809:

s801, responding to an audio instruction input by a user, and extracting a first voiceprint vector from the audio instruction.

S802, central vectors of a plurality of clustering clusters are obtained.

Optionally, the center vectors included in the plurality of pre-stored cluster clusters are obtained from the voiceprint database.

And S803, respectively calculating the similarity between the first voiceprint vector and the central vectors of the plurality of clustering clusters to obtain a plurality of first similarity parameters.

And the similarity parameter is obtained by calculating the distance between the first voiceprint vector and the second voiceprint vector.

S804, determining a target similarity parameter from the plurality of first similarity parameters.

The target similarity parameter is the largest similarity parameter among the plurality of first similarity parameters, and indicates that the matching degree of the first voiceprint vector and the central vector corresponding to the target similarity parameter is the highest.

S805, judging whether the target similarity parameter is larger than a first similarity threshold value, and judging whether the target similarity parameter is smaller than a second similarity threshold value.

If the target similarity parameter is less than or equal to the first similarity threshold and greater than or equal to the second similarity threshold, S806 is performed.

In the case where the target similarity parameter is greater than the first similarity threshold value, S808 is performed.

In the case where the target similarity parameter is smaller than the second similarity threshold value, S809 is executed.

And S806, calculating a first divergence parameter and a second divergence parameter.

S807, judging whether the first divergence parameter is smaller than the second divergence parameter.

When the first divergence parameter is smaller than the second divergence parameter, S808 is performed.

When the first divergence parameter is smaller than the second divergence parameter, the first voiceprint vector is stored in the first cluster, and compared with the method of establishing a new cluster to store the first voiceprint vector, the degree of divergence of all cluster clusters in the voiceprint database is smaller, so that the first cluster included in the current voiceprint database can be determined to store the first voiceprint vector.

If the first dispersion parameter is greater than the second dispersion parameter, step S809 is executed.

And S808, taking the cluster corresponding to the target similarity parameter as the first cluster.

The first clustering cluster is a clustering cluster to which the first voiceprint vector belongs, and represents that the first voiceprint vector is stored in the first clustering cluster contained in the voiceprint database.

And S809, establishing a second cluster, and storing the first voiceprint vector to the second cluster.

It should be noted that the detailed implementation manner of the above steps is the same as or similar to the audio instruction based search method set forth in steps S401 to S404, and is not repeated herein.

In summary, the audio instruction-based search method provided by the present disclosure, in response to an audio instruction input by a user, extracts a first voiceprint vector of the user from the audio instruction, matches the first voiceprint vector with a second voiceprint vector stored in a voiceprint database, and determines a first cluster as a cluster because the second voiceprint vector matching the first voiceprint vector belongs to the cluster when it is determined that the first voiceprint vector matches at least one second voiceprint vector, which indicates that the first voiceprint vector corresponds to the first cluster, and then stores the first voiceprint vector in the first cluster; further, according to the user preference information corresponding to the first cluster, the preference information of the user who inputs the audio instruction can be determined. By the method, the user can be registered in a non-sensitive manner, the intelligent home obtains the voiceprint vector of the user and stores the voiceprint vector to complete registration while the user sends the audio instruction to the intelligent home, and after the cluster to which the voiceprint vector belongs is determined, the preference information of the user can be determined to further respond to the audio instruction of the user in a targeted manner, so that diversified requirements of the user are met, and the use experience of the user is improved.

An embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the search method based on an audio instruction in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

Embodiments of the present invention provide a computer program product, where the computer program is stored, and when being executed by a processor, the computer program implements each process of the search method based on an audio instruction in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

In the present disclosure, the Processor may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In the present disclosure, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

In the present disclosure, computer-readable media include both non-transitory and non-transitory, removable and non-removable storage media. Storage media may implement information storage by any method or technology, and the information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An electronic device, comprising:

a controller configured to:

responding to the audio instruction according to the user preference information.

2. The electronic device of claim 1, wherein the controller is further configured to:

storing the first voiceprint vector to the first cluster.

3. The electronic device of claim 1, wherein the at least one second acoustic line vector is a first center vector of the first cluster, and wherein the first center vector is an average vector calculated from the plurality of second acoustic line vectors.

4. The electronic device of claim 3, wherein the controller is specifically configured to:

a plurality of clustering clusters are stored in advance, wherein the clustering clusters comprise the first clustering cluster;

obtaining central vectors of the plurality of clustering clusters, wherein the central vectors of the plurality of clustering clusters comprise the first central vector;

respectively calculating the similarity between the central vectors of the clustering clusters and the first voiceprint vector to obtain a plurality of first similarity parameters;

and under the condition that the target similarity parameter is greater than a first similarity threshold value, taking the cluster corresponding to the target similarity parameter as the first cluster.

5. The electronic device of claim 4, wherein the controller is further configured to:

6. The electronic device of claim 4, wherein the controller is further configured to:

calculating a first divergence parameter and a second divergence parameter when the target similarity parameter is less than or equal to the first similarity threshold and greater than or equal to a second similarity threshold;

calculating a first divergence parameter according to the cluster to be clustered and a first other cluster, wherein the first other cluster is a cluster corresponding to the target similarity parameter in the plurality of clusters;

taking the first voiceprint vector as a new cluster;

calculating a second dispersion parameter according to the new cluster and the clusters, wherein the dispersion parameter is used for representing the cohesion degree between different clusters;

when the first divergence parameter is smaller than the second divergence parameter, taking the cluster corresponding to the target similarity parameter as the first cluster;

7. The electronic device of any of claims 1-6, wherein the controller is further configured to:

acquiring a first central vector of any cluster;

acquiring a second central vector of a second other cluster except any one cluster;

if the second similarity parameter is larger than or equal to a third similarity threshold, calculating preference similarity parameters between corresponding first user preference information of a cluster to which the first center vector belongs and corresponding second user preference information of the cluster to which the second center vector belongs;

if the preference similarity parameter is greater than a fourth similarity threshold, merging the cluster to which the first central vector belongs and the cluster to which the second central vector belongs to obtain a merged cluster.

8. A search method based on audio instructions is characterized by comprising the following steps:

9. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements an audio instruction based search method as claimed in claim 7.

10. A computer program product, comprising: when the computer program product is run on a computer, the computer is caused to implement an audio instruction based search method as claimed in claim 7.