CN112542173A

CN112542173A - Voice interaction method, device, equipment and medium

Info

Publication number: CN112542173A
Application number: CN202011371761.XA
Authority: CN
Inventors: 李卓茜; 陶武超; 唐光远; 罗琴; 张俊杰; 李润静
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-23

Abstract

The invention discloses a voice interaction method, a device, equipment and a medium, which can determine a target search library corresponding to first identification information according to the first identification information of a user corresponding to a determined voiceprint feature vector, and determine and output a target response in the target search library, thereby reducing the search range when determining the target response, reducing the delay of voice interaction and improving the user experience.

Description

Voice interaction method, device, equipment and medium

Technical Field

The present invention relates to the field of voice interaction technologies, and in particular, to a voice interaction method, apparatus, device, and medium.

Background

The gradual maturity of artificial intelligence technique has promoted the application of voice interaction technique, uses the voice interaction function in air conditioning equipment to can promote air conditioning equipment's competitiveness in the diversified market environment day by day greatly, the air conditioning equipment who has increased the voice interaction function becomes the electronic equipment that can carry out the voice interaction.

In the prior art, when voice interaction is performed on electronic equipment capable of performing voice interaction, the electronic equipment in a dormant state is awakened through an awakening word, and then the electronic equipment understands the intention of a user and makes a corresponding response based on a semantic recognition technology, so that interaction between human and machines is realized.

However, in the current voice interaction, when the electronic device responds, the electronic device needs to perform global search on the text information based on the existing response model according to the text information converted from the voice message, so as to make a corresponding response. Due to the fact that the search range during global search is large, delay during voice interaction is large, and user experience is reduced.

Disclosure of Invention

The embodiment of the invention provides a voice interaction method, a voice interaction device, voice interaction equipment and a voice interaction medium, which are used for solving the problems of large voice interaction delay and poor user experience in the prior art.

The embodiment of the invention provides a voice interaction method, which comprises the following steps:

determining first identification information of a user corresponding to a voiceprint feature vector in the collected voice information based on a voiceprint recognition model trained in advance;

determining a target search library corresponding to the first identification information according to the first identification information and the corresponding relation between the identification information and a search library generated in advance;

and determining a target response in the target search library according to the text information corresponding to the acquired voice information and outputting the target response.

Further, before determining a target response in the target search library and outputting the target response according to the text information corresponding to the collected voice information, the method further includes:

determining a target language corresponding to the collected voice information based on a pre-trained language identification model;

and determining text information corresponding to the voice information of the target language according to the target language and the collected voice information.

Further, the training process for training the language identification model includes:

aiming at any sample voice information in a sample set, obtaining the sample voice information and first label information corresponding to the sample voice information, wherein the first label information identifies the language of the sample voice information;

inputting the sample voice information into an original deep learning model, and acquiring second label information of the output sample voice information;

and adjusting parameter values of all parameters of the original deep learning model according to the first label information and the second label information to obtain the trained language identification model.

Further, the determining a target response in the target search library according to the text information corresponding to the collected voice information includes:

determining each first keyword of which the part of speech is a target part of speech in the text information according to the text information corresponding to the collected voice information;

determining a target interest field according to each first keyword in the text information and a corresponding relation between a keyword prestored in the target search library and the interest field;

and determining the target response of the text message based on a response model corresponding to the target interest field generated in advance.

Further, the determining a target field of interest according to each first keyword in the text information and a correspondence between a keyword pre-stored in the target search library and the field of interest includes:

determining each weight value of each first interest field corresponding to each first keyword according to each first keyword in the text information, the corresponding relation between the keyword and the interest field pre-stored in the target search library and the corresponding relation between the interest field and the weight value, and determining the first interest field with the largest weight value as the target interest field; or

Determining each first quantity of the first keywords included in each first interest field corresponding to each first keyword according to each first keyword in the text information and the corresponding relation between the keywords pre-stored in the target search library and the interest fields, and determining the first interest field with the largest first quantity as the target interest field; or

Determining a product value of a first quantity of the first keywords and the weight value included in each first interest field corresponding to each first keyword according to each first keyword in the text information, a corresponding relation between the keywords pre-stored in the target search library and the interest field and a corresponding relation between the interest field and the weight value, and determining the first interest field with the largest product value as the target interest field.

Further, the method further comprises:

and aiming at each interest field in the search library, determining a first quantity of first keywords corresponding to the interest field and a second quantity of the first keywords corresponding to the interest field in the search library according to first keywords included in voice information collected within a set time period, and updating the weight value of the interest field according to the ratio of the first quantity to the second quantity.

Further, if it is determined that the target search library corresponding to the first identification information does not exist, the method further includes:

if a registration request of a user is received, storing the voiceprint feature vector in the collected voice information;

determining each second interest field corresponding to each second keyword according to each second keyword with the part of speech as the target part of speech in the text information corresponding to the voice information and a corresponding relation between the pre-stored keyword and the interest field;

and generating a target search library corresponding to the first identification information according to the corresponding relation between each second keyword and each second interest field and each pre-generated response model corresponding to each second interest field.

Further, the training process for training the voiceprint recognition model comprises:

aiming at any sample voice information in a sample set, obtaining the sample voice information and second identification information of a user corresponding to the sample voice information, wherein the second identification information identifies the user identity;

inputting the sample voice information into an original deep learning model, and acquiring third identification information of the output sample voice information;

and adjusting parameter values of all parameters of the original deep learning model according to the second identification information and the third identification information to obtain the trained voiceprint recognition model.

Correspondingly, an embodiment of the present invention provides a voice interaction apparatus, where the apparatus includes:

the first determining module is used for determining first identification information of the user corresponding to the voiceprint feature vector in the collected voice information based on a voiceprint recognition model which is trained in advance;

the second determining module is used for determining a target search library corresponding to the first identification information according to the first identification information and the corresponding relation between the identification information and a search library generated in advance;

and the third determining module is used for determining and outputting a target response in the target search library according to the text information corresponding to the acquired voice information.

Further, the first determining module is further configured to determine a target language type corresponding to the acquired voice information based on a pre-trained language type recognition model; and determining text information corresponding to the voice information of the target language according to the target language and the collected voice information.

Further, the apparatus further comprises:

the training module is used for acquiring the sample voice information and first label information corresponding to the sample voice information aiming at any sample voice information in a sample set, wherein the first label information identifies the language of the sample voice information; inputting the sample voice information into an original deep learning model, and acquiring second label information of the output sample voice information; and adjusting parameter values of all parameters of the original deep learning model according to the first label information and the second label information to obtain the trained language identification model.

Further, the third determining module is specifically configured to determine, according to text information corresponding to the acquired voice information, each first keyword whose part of speech in the text information is a target part of speech; determining a target interest field according to each first keyword in the text information and a corresponding relation between a keyword prestored in the target search library and the interest field; and determining the target response of the text message based on a response model corresponding to the target interest field generated in advance.

Further, the third determining module is specifically configured to determine, according to each first keyword in the text information, a correspondence between a keyword and an interest field pre-stored in the target search library, and a correspondence between an interest field and a weight value, each weight value of each first interest field corresponding to each first keyword, and determine the first interest field with the largest weight value as the target interest field; or determining each first quantity of the first keywords included in each first interest field corresponding to each first keyword according to each first keyword in the text information and the corresponding relation between the keywords pre-stored in the target search library and the interest fields, and determining the first interest field with the largest first quantity as the target interest field; or determining a product value of a first number of the first keywords and the weight value included in each first interest field corresponding to each first keyword according to each first keyword in the text information, a corresponding relation between the keywords and the interest fields pre-stored in the target search library and a corresponding relation between the interest fields and the weight values, and determining the first interest field with the largest product value as the target interest field.

Further, the apparatus further comprises:

and the updating module is used for determining a first quantity of the first keywords corresponding to the interest field and a second quantity of the first keywords corresponding to the interest field in the search library according to the first keywords included in the collected voice information in a set time period aiming at each interest field in the search library, and updating the weight value of the interest field according to the ratio of the first quantity to the second quantity.

Further, the apparatus further comprises:

the registration module is used for storing the voiceprint feature vector in the collected voice information if the target search library corresponding to the first identification information does not exist, and if a registration request of a user is received; determining each second interest field corresponding to each second keyword according to each second keyword with the part of speech as the target part of speech in the text information corresponding to the voice information and a corresponding relation between the pre-stored keyword and the interest field; and generating a target search library corresponding to the first identification information according to the corresponding relation between each second keyword and each second interest field and each pre-generated response model corresponding to each second interest field.

Further, the training module is specifically configured to, for any sample voice information in a sample set, obtain the sample voice information and second identification information of a user corresponding to the sample voice information, where the second identification information identifies a user identity; inputting the sample voice information into an original deep learning model, and acquiring third identification information of the output sample voice information; and adjusting parameter values of all parameters of the original deep learning model according to the second identification information and the third identification information to obtain the trained voiceprint recognition model.

Accordingly, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory is used to store program instructions, and the processor is used to implement the steps of any one of the above voice interaction methods when executing a computer program stored in the memory.

Accordingly, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of any one of the above voice interaction methods.

The embodiment of the invention provides a voice interaction method, a voice interaction device, a voice interaction equipment and a voice interaction medium.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic process diagram of a voice interaction method according to an embodiment of the present invention;

fig. 2 is a schematic process diagram of a voice interaction method according to an embodiment of the present invention;

FIG. 3 is a process diagram of a complete voice interaction method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to reduce delay of voice interaction and improve user experience, embodiments of the present invention provide a voice interaction method, apparatus, device, and medium.

Example 1:

fig. 1 is a schematic process diagram of a voice interaction method according to an embodiment of the present invention, where the process includes the following steps:

s101: and determining first identification information of the user corresponding to the voiceprint feature vector in the collected voice information based on the voiceprint recognition model trained in advance.

The voice interaction method provided by the embodiment of the invention is applied to electronic equipment capable of performing voice interaction, wherein the electronic equipment can be intelligent terminal equipment capable of performing voice interaction, such as a mobile phone, a tablet personal computer, a PC (personal computer), an intelligent air conditioner, an intelligent sound box and the like.

In order to reduce the delay of voice interaction, in the embodiment of the present invention, the electronic device stores, for each user registered in advance, a search library corresponding to the user, so that when voice information is collected, a target response of the voice information may be determined based on the search library of the user corresponding to the voice information.

The electronic device may determine, for each user registered in advance, first identification information of the user based on a voiceprint feature vector of the user, where the first identification information may be information that uniquely identifies the user, such as a voiceprint feature vector, and specifically, this is not limited in the embodiment of the present invention.

Because the electronic equipment can not carry out voice interaction when in the dormant state, if the electronic equipment is in the dormant state, the electronic equipment firstly receives a fixed awakening word and is awakened from the dormant state; if the electronic equipment is in a non-sleep state, aiming at the collected voice information, the electronic equipment stores a pre-trained voiceprint recognition model in order to determine first identification information of the user corresponding to a voiceprint feature vector in the collected voice information, inputs the voice information into the pre-trained voiceprint recognition model, and processes the voice information by the voiceprint recognition model to determine the first identification information of the user corresponding to the voiceprint feature vector of the voice information.

The voiceprint recognition model may be an identity vector (i-vector) model, an x-vector model based on a Time Delay Neural Network (TDNN), a residual error network (resnet), or other neural network models. Preferably, the voiceprint recognition model in the embodiment of the present invention is an x-vector model.

S102: and determining a target search library corresponding to the first identification information according to the first identification information and the corresponding relation between the identification information and a search library generated in advance.

Because the electronic device stores a search library corresponding to each pre-registered user, in order to determine a target search library corresponding to the first identification information, in the embodiment of the present invention, the electronic device stores a correspondence between the identification information and a pre-generated search library in advance; the search library is determined according to historical search of the user, and the search library comprises keywords corresponding to the interest field of the user and a response model.

After the electronic equipment determines the first identification information, the electronic equipment determines the first identification information in the corresponding relation according to the corresponding relation between the pre-stored identification information and a pre-generated search library, and determines the search library corresponding to the first identification information as a target search library.

S103: and determining a target response in the target search library according to the text information corresponding to the acquired voice information and outputting the target response.

After the electronic equipment determines the target search library, converting the voice information into text information according to the collected voice information; the method for converting voice information into text information belongs to the prior art, and the embodiment of the invention is not described herein.

And according to the text information converted by the voice information, the electronic equipment determines a target response corresponding to the text information in the target search library and outputs the target response. The process of determining a response according to the text information belongs to the prior art, and is not described herein in detail in the embodiments of the present invention.

The target response may be played in a voice mode when the electronic device outputs the target response, the target response may also be displayed on a display screen of the electronic device, or the target response may also be played in a voice mode when the target response is displayed on the display screen of the electronic device.

According to the embodiment of the invention, the target search library corresponding to the first identification information can be determined according to the first identification information of the user corresponding to the determined voiceprint feature vector, and the target response is determined and output in the target search library, so that the search range when the target response is determined is reduced, the delay of voice interaction is reduced, and the user experience is improved.

Example 2:

in order to improve flexibility of voice interaction, on the basis of the above embodiment, in an embodiment of the present invention, before determining a target response in the target search library according to text information corresponding to the collected voice information and outputting the target response, the method further includes:

determining a target language corresponding to the collected voice information based on a language identification model trained in advance;

Because in the prior art, voice interaction of a single language is often performed during voice interaction, and globalization development accelerates the flow and fusion of population in various regions around the world, a situation that multi-language languages are used in a family scene appears, for example, when family members come from the united kingdom, the china and the japan respectively, or the family members are respectively the han nationality, the uygur nationality and the hassak nationality, or the family members respectively speak mandarin, cantonese and the like, the current voice interaction of a single language cannot meet the voice interaction requirements of multi-nationality and multi-nationality fusion, so that the flexibility of the voice interaction is poor.

In the embodiment of the invention, the electronic equipment can also recognize the voice information of different languages; the languages may be different languages, such as chinese, english, russian, vietnamese, japanese, korean, and uygur, or different dialects of the same language, such as mandarin, cantonese, and minnandialect.

In order to realize that the electronic equipment recognizes the voice information of different languages, a trained language recognition model is stored in the electronic equipment in advance. The electronic equipment inputs the voice information into the language recognition model trained in advance, and the language recognition model processes the voice information to determine a target language corresponding to the voice information.

The language identification model may be an identity vector (i-vector) model, an x-vector model based on a Time Delay Neural Network (TDNN), a residual error network (resnet), or other neural network models. The language identification model and the voiceprint identification model may be the same model type or different model types, and preferably, in order to reduce resource consumption, the language identification model and the voiceprint identification model are the same model type, i.e., the language identification model is also an x-vector model in the embodiment of the present invention.

The electronic device converts the voice information of the target language and converts the voice information of the target language into corresponding text information according to the determined target language and the collected voice information, wherein a method for converting the voice information into the text information for the voice information of each language belongs to the prior art, and embodiments of the present invention are not limited.

The voice interaction method according to the embodiment of the present invention is described below with a specific embodiment, and fig. 2 is a schematic process diagram of the voice interaction method according to the embodiment of the present invention, where the method includes the following steps:

s201: and determining first identification information of the user corresponding to the voiceprint feature vector in the collected voice information based on the voiceprint recognition model trained in advance.

S202: and determining a target search library corresponding to the first identification information according to the first identification information and the corresponding relation between the identification information and a search library generated in advance.

S203: and determining a target language corresponding to the collected voice information based on a pre-trained language identification model.

S204: and determining text information corresponding to the voice information of the target language according to the target language and the collected voice information.

S205: and determining a target response in the target search library according to the text information corresponding to the acquired voice information and outputting the target response.

Example 3:

in order to implement the training of the language identification model, on the basis of the above embodiments, in an embodiment of the present invention, the training process of the language identification model includes:

In order to implement training of a language recognition model, in an embodiment of the present invention, a sample set for training is stored, sample voice information in the sample set includes sample voice information of each language, and first tag information of the sample voice information in the sample set is manually pre-labeled, where the first tag information is used to identify the language of the sample voice information.

In the embodiment of the present invention, after any sample voice information in a sample set and first tag information corresponding to the sample voice information are acquired, the sample voice information is input into an original deep learning model, and the original deep learning model outputs second tag information of the sample voice information, where the second tag information identifies a language of the sample voice information recognized by the original deep learning model.

And after determining second label information of the sample voice information according to the original deep learning model, training the original deep learning model according to the second label information and the first label information of the sample voice information so as to adjust parameter values of various parameters of the original deep learning model.

And carrying out the operation on each sample voice information contained in the sample set for training the original deep learning model, and obtaining the trained language identification model when the preset condition is met. The preset condition can be that the number of sample voice information with consistent first label information and second label information obtained after sample voice information in a sample set is trained by an original deep learning model is larger than a set number; or the iteration number of training the original deep learning model reaches the set maximum iteration number, and the like. Specifically, in the embodiment of the invention, the open source framework kaldi is selected to train the original deep learning model.

As a possible implementation manner, when the original deep learning model is trained, the sample voice information in the sample set may be divided into training sample voice information and test sample voice information, the original deep learning model is trained based on the training sample voice information, and then the reliability of the trained language identification model is tested based on the test sample voice information. The language identification model is superior when the data volume is larger, so that when the sample set training sample voice information is less, the training sample voice information can be preprocessed to obtain more training sample voice information, and specifically, a data enhancement preprocessing mode can be adopted, wherein the data enhancement preprocessing mode comprises modes of speed change, time duration change and the like of the training sample voice information.

When the trained language identification model identifies the collected voice information, the feature vector of the voice information is firstly determined, and the analysis is carried out based on the feature vector, so that the identification result is determined. In order to obtain the target feature vector of the voice information, in the embodiment of the present invention, the voice information is processed based on the language identification model trained in advance, and the target feature vector of the voice information is obtained.

When the language identification model is of a different model type, the obtained target Feature vector of the speech information is also of a different type, for example, the target Feature vector may be a Mel Frequency Cepstral Coefficient (MFCC), a Bottleneck layer Feature (BNF), or a Mel-scale Filter Bank (Fbank) Feature. Preferably, in the embodiment of the present invention, the target eigenvector is selected from the BNF eigenvectors because the BNF eigenvectors have better robustness and noise resistance.

And after the target characteristic vector of the voice information is determined, inputting the target characteristic vector into a rear-end classifier of the language identification model, and determining the language of the characteristic vector with the similarity greater than a set threshold value and the highest similarity value as the target language of the voice information.

The back-end classifier may select a generative model or a discriminant model, where the generative model may be a traditional Probability Linear Discriminant Analysis (PLDA) model, and the discriminant model may be eXtreme Gradient boost (boost ), Random Forest (RF), Support Vector Machine (SVM), Cosine Distance Scoring (CDS), and the like. Preferably, the back-end classifier selects the PLDA model.

Specifically, in the embodiment of the present invention, the electronic device inputs the speech information into the language recognition model trained in advance, and extracts the imbedding feature vector from the second last layer of the TDNN network of the language recognition model, where the dimension of the imbedding feature vector is 512 dimensions.

The 512-dimensional embedding feature vector belongs to a high-dimensional feature vector, when the feature vector is matched with the high-dimensional feature vector, the feature dimension is too large, so that the key feature cannot be accurately positioned, and the calculation amount when the similarity is calculated is increased due to the increase of the feature dimension, so that the speed when the feature vector is matched is reduced. Therefore, in the embodiment of the present invention, after the language identification model determines the high-dimensional feature vector of the speech information, the language identification model also needs to perform dimension reduction processing on the high-dimensional feature vector, so as to obtain the low-dimensional feature vector of the speech information.

When the 512-dimensional embedding feature vector is subjected to dimension reduction processing by the language identification model, a 23-dimensional embedding vector can be obtained by dimension reduction using a method in the prior art, such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), and the 23-dimensional embedding feature vector is input to a back-end classifier PLDA model of the language identification model, and a language of the feature vector with the highest similarity Value, the similarity of which is greater than a set threshold Value, of the 23-dimensional embedding vector is determined as the target language of the speech information based on the back-end classifier PLDA model.

Example 4:

in order to improve the efficiency of determining the target response, on the basis of the foregoing embodiments, in an embodiment of the present invention, the determining the target response in the target search library according to the text information corresponding to the collected voice information includes:

Since the interest field of the user for frequently performing voice interaction is fixed, for example, the interest field includes a music field, a basketball field, a star field, and the like; therefore, after the electronic device determines the target search library, in order to improve the efficiency of determining the target response in the target search library, the electronic device may also first determine the target interest field corresponding to the voice information.

After the electronic equipment determines the collected text information corresponding to the voice information, in order to determine the target interest field corresponding to the voice information, the electronic equipment determines the part of speech of each word in the text information, and takes the word with the part of speech as the target part of speech as the keyword in the text information. For example, the target part of speech includes nouns, numerals, and the like.

In order to determine a target interest field corresponding to the text information, a corresponding relation between keywords and the interest field is also pre-stored in a target search library of the electronic device, the interest field corresponding to each first keyword is determined according to each determined first keyword in the text information, and the target interest field corresponding to the text information is determined from the interest field corresponding to each first keyword.

Specifically, the electronic device may determine, according to the interest field corresponding to each first keyword, the interest field corresponding to any one of the first keywords as the target interest field.

After the target interest field corresponding to the text information is determined, in order to determine a target response corresponding to the text information, a response model corresponding to the target interest field which is generated in advance is also stored in a target search library of the electronic device, and the electronic device determines the target response corresponding to the text information after inputting the text information into the target response model corresponding to the target interest field.

Example 5:

in order to determine a target interest field corresponding to text information, on the basis of the foregoing embodiments, in an embodiment of the present invention, the determining a target interest field according to each first keyword in the text information and a correspondence between a keyword prestored in the target search library and an interest field includes:

determining each weight value of each first interest field corresponding to each first keyword according to each first keyword in the text information, a corresponding relation between a keyword and an interest field pre-stored in the target search library and a corresponding relation between the interest field and the weight value, and determining the first interest field with the largest weight value as a target interest field, wherein each interest field in the corresponding relation has a corresponding weight value; or

After the electronic equipment determines each first keyword in the text information, because the corresponding relation between the keyword and the interest field and the corresponding relation between the interest field and the weight value are stored in the target search library, and because each interest field in the corresponding relation has the corresponding weight value, the electronic equipment can determine each first interest field corresponding to each first keyword; according to each determined first interest field and the weight value of the interest field, the electronic equipment can determine the weight value of each first interest field.

Because the weight value represents the search frequency of the user for the interest field, after the weight value of each first interest field is determined, the electronic device may determine the maximum weight value, and determine the first interest field corresponding to the maximum weight value as the target interest field corresponding to the text information.

When the corresponding relation between the interest field and the weight value does not exist, in the embodiment of the invention, the electronic equipment can also determine the target interest field corresponding to the text information according to the number of the determined first keywords.

Specifically, after determining each first keyword in the text information, the electronic device may determine each first interest field corresponding to each first keyword according to a correspondence relationship between the keyword and the interest field stored in the target search library.

Because there may be a plurality of first keywords corresponding to the same first interest field, for each first interest field, the electronic device may determine a first number of the first keywords corresponding to the first interest field, where the greater the first number is, the greater the possibility that the first interest field is the target interest field corresponding to the text information is, so that the electronic device determines a maximum first numerical value, and determines the first interest field corresponding to the maximum first numerical value as the target interest field corresponding to the text information.

When there is a corresponding relationship between the interest field and the weight value, in the embodiment of the present invention, in order to improve the accuracy of determining the target interest field corresponding to the text information, the electronic device may further determine the target interest field corresponding to the text information by comprehensively considering the weight value and the first quantity.

Specifically, after the electronic device determines each first keyword in the text information, since the target search library stores the corresponding relationship between the keyword and the interest field and the corresponding relationship between the interest field and the weight value, the electronic device can determine each first interest field corresponding to each first keyword; according to each determined first interest field and the weight value of the interest field, the electronic equipment can determine the weight value of each first interest field; for each first interest field, the electronic device may determine a first number of first keywords corresponding to the first interest field.

Because the greater the first number is, the greater the possibility that the first interest field is the target interest field corresponding to the text information is, and the greater the weight value is, the greater the frequency of the first interest field in the historical search of the user is, when the weight value and the first number are considered comprehensively, the electronic device determines, for each first interest field corresponding to each first keyword, a product value of a first number of the first keywords included in the first interest field and the weight value, thereby determining a maximum product value, and determining the first interest field corresponding to the maximum product value as the target interest field corresponding to the text information.

Example 6:

in order to improve the accuracy of the weighted values in the field of interest, on the basis of the foregoing embodiments, in an embodiment of the present invention, the method further includes:

In order to improve the accuracy of the weight value of each interest field in the search base of the user, in the embodiment of the present invention, the electronic device may further update the weight value of the interest field in the search base every time a set time period elapses.

Specifically, the electronic device may determine, according to the voice information collected within a set time period, all first keywords included in the voice information, and determine, for each interest field in the search library and each first keyword included in the voice information, whether the first keyword included in the voice information is included in the keywords included in the interest field according to a correspondence between the keyword and the interest field stored in the search library, and if so, add 1 to the first number, thereby finally determining the first number of the first keywords corresponding to the interest field.

Determining a sum of the first number corresponding to each interest field as a second number corresponding to the same first keyword in the interest fields included in the search library according to the first number corresponding to the same first keyword in each interest field, determining a ratio of the first number to the second number by the electronic device according to the first number and the second number, and updating the weight value of each interest field based on the ratio.

Specifically, when the weighted value of the interest field is updated according to the ratio, the ratio may be determined as the updated weighted value of the interest field, or an average value of the ratio and the weighted value may be determined, and the average value is determined as the updated weighted value of the interest field.

Example 7:

in order to reduce the delay of voice interaction, on the basis of the foregoing embodiments, in an embodiment of the present invention, if it is determined that there is no target search library corresponding to the first identification information, the method further includes:

When the electronic device determines that the target search library corresponding to the first identification information does not exist, the electronic device indicates that the target search library corresponding to the user is not saved by the electronic device, and therefore if the electronic device receives a registration request of the user, the electronic device can generate the target search library of the user according to the collected voice information.

After the electronic device receives the registration request of the user, the electronic device also stores the voiceprint feature vector in the collected voice information in order to identify the corresponding user according to the collected voice information. Specifically, the voiceprint feature vector in the collected voice information is determined by the pre-trained voiceprint recognition model. Preferably, when the electronic device stores the voiceprint feature vector in the collected voice information, the voice information may include voice information with three different speech periods, including a speech period less than 3 seconds, a speech period within 3 seconds to 10 seconds, and a speech period greater than 10 seconds.

In order to generate the search base of the user, the electronic equipment determines the part of speech of each word in the text information according to the text information corresponding to the collected voice information, determines each second keyword with the part of speech being a target part of speech according to the part of speech of each word, and determines each second interest field corresponding to each second keyword according to the determined second keyword and the corresponding relation between the pre-stored keyword and the interest field.

Because a plurality of second keywords correspond to the same second interest field, the electronic device may determine, for each second interest field, a second keyword corresponding to the second interest field, that is, a corresponding relationship between each second keyword and each second interest field, where the corresponding relationship is a many-to-one corresponding relationship.

And the electronic equipment stores the corresponding relation and each response model according to the corresponding relation between each second keyword and each second interest field and each response model corresponding to each pre-generated second interest field, so as to generate a target search library corresponding to the first identification information. Each response model corresponding to each second interest field is generated in advance, and the response models corresponding to the same interest field in the search libraries of different users are the same response model.

The maximum number of the interest fields that can be included in the search base of the user is preset, and may be any number, and preferably, the maximum number is set to 10 in order to improve the efficiency of the voice interaction.

Example 8:

in order to implement training of a voiceprint recognition model, on the basis of the above embodiments, in an embodiment of the present invention, a training process for training the voiceprint recognition model includes:

In order to implement training of the voiceprint recognition model, in the embodiment of the present invention, a sample set for training is stored, sample voice information in the sample set includes sample voice information of each user, second identification information of the sample voice information in the sample set is manually pre-labeled, and the second identification information is used for identifying an identity of the user.

In the embodiment of the present invention, after any sample voice information in a sample set and second identification information corresponding to the sample voice information are acquired, the sample voice information is input into an original deep learning model, and the original deep learning model outputs third identification information of the sample voice information. Wherein the third identification information identifies the identity of the user of the sample speech information identified by the original deep learning model.

After determining the third identification information of the sample voice information according to the original deep learning model, training the original deep learning model according to the third identification information and the second identification information of the sample voice information to adjust parameter values of various parameters of the original deep learning model.

And carrying out the operation on each sample voice information contained in the sample set for training the original deep learning model, and obtaining the trained voiceprint recognition model when a preset condition is met. The preset condition can be that the number of sample voice information with the second identification information and the third identification information which are obtained after the sample voice information in the sample set is trained by the original deep learning model is larger than the set number; or the iteration number of training the original deep learning model reaches the set maximum iteration number, and the like. Specifically, the embodiment of the present invention is not limited to this.

As a possible implementation manner, when the original deep learning model is trained, the sample voice information in the sample set may be divided into training sample voice information and test sample voice information, the original deep learning model is trained based on the training sample voice information, and then the reliability of the trained voiceprint recognition model is tested based on the test sample voice information. The voiceprint recognition model is in a direct relation with the number of the voiceprint recognition models during training, and the voiceprint recognition model is better when the data volume is larger, so that when the sample set training sample voice information is less, the training sample voice information can be preprocessed to obtain more training sample voice information, and a data enhancement preprocessing mode can be specifically adopted, wherein the data enhancement preprocessing mode comprises modes of speed change, time duration change and the like of the training sample voice information.

When the trained voiceprint recognition model carries out voiceprint recognition on the collected voice information, firstly determining the characteristic vector of the voice information, and analyzing based on the characteristic vector so as to determine a recognition result. In order to obtain the target feature vector of the voice information, in the embodiment of the present invention, the voice information is processed based on the recognition model trained in advance, so as to obtain the target feature vector of the voice information.

When the voiceprint recognition model is of a different model type, the obtained target Feature vector of the speech information is also of a different type, for example, the target Feature vector may be a Mel Frequency Cepstral Coefficient (MFCC), a Bottleneck layer Feature (BNF), or a Mel-scale Filter Bank (Fbank) Feature. Preferably, in the embodiment of the present invention, the target feature vector is an Fbank feature vector.

And after the target characteristic vector of the voice information is determined, inputting the target characteristic vector into the language rear-end classifier, and determining that the identification information of the user of the voiceprint characteristic vector with the similarity greater than a set threshold and the highest similarity value is the target identification information of the voice information.

When the voiceprint recognition model is of a different model type, the obtained target Feature vector of the speech information is also of a different type, for example, the target Feature vector may be a Mel Frequency Cepstral Coefficient (MFCC), a Bottleneck layer Feature (BNF), or a Mel-scale Filter Bank (Fbank) Feature.

And after the target characteristic vector of the voice information is determined, inputting the target characteristic vector into a rear-end classifier of the voiceprint recognition model, and determining the identification information of the characteristic vector with the highest similarity with the target characteristic vector as the target identification information of the voice information.

Specifically, in the embodiment of the present invention, the electronic device inputs voice information into the pre-trained voiceprint recognition model, and extracts an embedding feature vector from a penultimate layer of the TDNN network of the voiceprint recognition model, where a dimension of the embedding feature vector is 512 dimensions.

After determining the high-dimensional feature vector of the voice information, the voiceprint recognition model also needs to perform dimensionality reduction on the high-dimensional feature vector, so as to obtain the low-dimensional feature vector of the voice information.

When the 512-dimensional embedding feature vector is subjected to the dimension reduction processing by the voiceprint recognition model, a 23-dimensional embedding vector can be obtained by dimension reduction using a method in the prior art, such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), and the 23-dimensional embedding feature vector is input to a back-end classifier PLDA model of the voiceprint recognition model, and the identification information of the feature vector of which the similarity of the 23-dimensional embedding vector is greater than a set threshold and the Value of the similarity is the highest is determined as the target identification information of the voice information based on the back-end classifier PLDA model.

Example 9:

the voice interaction method according to the embodiment of the present invention is described below with a complete embodiment, and fig. 3 is a schematic process diagram of the complete voice interaction method according to the embodiment of the present invention, as shown in fig. 3, the method includes the following steps:

s301: and determining first identification information of the user corresponding to the voiceprint feature vector in the collected voice information based on the voiceprint recognition model trained in advance.

S302: and if the target search library corresponding to the first identification information is determined according to the first identification information and the corresponding relation between the identification information and the pre-generated search library, performing S303, and if not, performing S311.

S303: and determining a target language corresponding to the collected voice information based on a pre-trained language identification model.

S304: and determining text information corresponding to the voice information of the target language according to the target language and the collected voice information.

S305: determining each first keyword of which the part of speech is a target part of speech in the text information according to the text information corresponding to the collected voice information; and determining the target interest field according to each first keyword in the text information and the corresponding relation between the keyword and the interest field pre-stored in the target search library, and performing any one of the steps S306, S307 and S308.

S306: determining each weight value of each first interest field corresponding to each first keyword according to each first keyword in the text information, the corresponding relation between the keyword and the interest field pre-stored in the target search library, and the corresponding relation between the interest field and the weight value, determining the first interest field with the largest weight value as the target interest field, and performing S309.

S307: determining each first quantity of the first keywords included in each first interest field corresponding to each first keyword according to each first keyword in the text information and the corresponding relation between the keywords pre-stored in the target search library and the interest fields, determining the first interest field with the largest first quantity as the target interest field, and performing S309.

S308: determining a product value of a first number of the first keywords included in each first interest field corresponding to each first keyword and the weight value according to each first keyword in the text information, a corresponding relation between the keywords pre-stored in the target search library and the interest field and a corresponding relation between the interest field and the weight value, determining the first interest field with the largest product value as the target interest field, and performing S309.

S309: and determining the target response of the text message based on a response model corresponding to the target interest field generated in advance.

S310: according to each interest field in the search library, determining a first number of first keywords corresponding to the interest field and a second number of the first keywords corresponding to the interest field in the search library according to first keywords included in voice information collected within a set time period, and updating the weight value of the interest field according to the ratio of the first number to the second number.

S311: if the target search library corresponding to the first identification information does not exist, storing the voiceprint feature vector in the collected voice information if a registration request of the user is received; determining each second interest field corresponding to each second keyword according to each second keyword with the part of speech as the target part of speech in the text information corresponding to the voice information and a corresponding relation between the pre-stored keyword and the interest field; and generating a target search library corresponding to the first identification information according to the corresponding relation between each second keyword and each second interest field and each pre-generated response model corresponding to each second interest field.

Example 10:

on the basis of the foregoing embodiments, fig. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present invention, where the apparatus includes:

a first determining module 401, configured to determine, based on a pre-trained voiceprint recognition model, first identification information of a user corresponding to a voiceprint feature vector in the collected voice information;

a second determining module 402, configured to determine, according to the first identification information and a correspondence between the identification information and a search library generated in advance, a target search library corresponding to the first identification information;

and a third determining module 403, configured to determine a target response in the target search library according to the text information corresponding to the acquired voice information, and output the target response.

Further, the apparatus further comprises:

Example 11:

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, an electronic device according to an embodiment of the present invention is further provided, where the electronic device includes a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memory 503 complete communication with each other through the communication bus 504;

the memory 503 has stored therein a computer program which, when executed by the processor 501, causes the processor 501 to perform the steps of:

Further, the processor 501 is further configured to, before determining and outputting a target response in the target search library according to the text information corresponding to the collected voice information, the method further includes:

Further, the training process of the processor 501 specifically configured to train the language identification model includes:

Further, the processor 501 is specifically configured to determine, according to the text information corresponding to the collected voice information, a target response in the target search library, where the determining includes:

Further, the processor 501 is specifically configured to determine, according to each first keyword in the text information and a correspondence between a keyword and an interest field pre-stored in the target search library, a target interest field, where the determining includes:

Further, the processor 501 is further configured to, for each interest field in the search library, determine, according to a first keyword included in the collected voice information within a set time period, a first number of the first keyword corresponding to the interest field and a second number of the first keyword corresponding to the interest field included in the search library, and update the weight value of the interest field according to a ratio of the first number to the second number.

Further, the processor 501 is further configured to, if it is determined that the target search library corresponding to the first identification information does not exist, the method further includes:

Further, the training process of the processor 501 specifically configured to train the voiceprint recognition model includes:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 502 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Example 12:

on the basis of the foregoing embodiments, an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to perform the following steps:

Further, the method further comprises:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of voice interaction, the method comprising:

2. The method according to claim 1, wherein before determining and outputting a target response in the target search library according to the text information corresponding to the collected voice information, the method further comprises:

3. The method of claim 2, wherein the training process for training the language identification model comprises:

4. The method according to claim 1 or 2, wherein the determining a target response in the target search library according to the text information corresponding to the collected voice information comprises:

5. The method according to claim 4, wherein the determining a target field of interest according to each first keyword in the text message and a correspondence between a keyword pre-stored in the target search library and the field of interest comprises:

6. The method of claim 5, further comprising:

7. The method of claim 1, wherein if it is determined that the target search library corresponding to the first identification information does not exist, the method further comprises:

8. The method of claim 1, wherein the training process for training the voiceprint recognition model comprises:

9. A voice interaction apparatus, comprising:

10. The apparatus according to claim 9, wherein the first determining module is further configured to determine a target language corresponding to the collected voice information based on a pre-trained language recognition model; and determining text information corresponding to the voice information of the target language according to the target language and the collected voice information.

11. The apparatus of claim 10, further comprising:

12. The apparatus according to claim 9 or 10, wherein the third determining module is specifically configured to determine, according to text information corresponding to the collected voice information, each first keyword whose part of speech is a target part of speech in the text information; determining a target interest field according to each first keyword in the text information and a corresponding relation between a keyword prestored in the target search library and the interest field; and determining the target response of the text message based on a response model corresponding to the target interest field generated in advance.

13. The apparatus according to claim 12, wherein the third determining module is specifically configured to determine, according to each first keyword in the text message, a correspondence between a keyword and an interest field pre-stored in the target search library, and a correspondence between an interest field and a weight value, each weight value of each first interest field corresponding to each first keyword, and determine a first interest field with a largest weight value as a target interest field; or determining each first quantity of the first keywords included in each first interest field corresponding to each first keyword according to each first keyword in the text information and the corresponding relation between the keywords pre-stored in the target search library and the interest fields, and determining the first interest field with the largest first quantity as the target interest field; or determining a product value of a first number of the first keywords and the weight value included in each first interest field corresponding to each first keyword according to each first keyword in the text information, a corresponding relation between the keywords and the interest fields pre-stored in the target search library and a corresponding relation between the interest fields and the weight values, and determining the first interest field with the largest product value as the target interest field.

14. The apparatus of claim 13, further comprising:

15. The apparatus of claim 9, further comprising:

16. The apparatus according to claim 9, wherein the training module is specifically configured to, for any sample voice information in a sample set, obtain the sample voice information and second identification information of a user corresponding to the sample voice information, where the second identification information identifies a user identity; inputting the sample voice information into an original deep learning model, and acquiring third identification information of the output sample voice information; and adjusting parameter values of all parameters of the original deep learning model according to the second identification information and the third identification information to obtain the trained voiceprint recognition model.

17. An electronic device, characterized in that the electronic device comprises a processor and a memory, the memory being adapted to store program instructions, the processor being adapted to execute a computer program stored in the memory to implement the steps of the voice interaction method according to any of claims 1-8.

18. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the voice interaction method according to any one of claims 1 to 8.