CN111768769A

CN111768769A - Voice interaction method, device, equipment and storage medium

Info

Publication number: CN111768769A
Application number: CN201910196765.XA
Authority: CN
Inventors: 曹元斌; 张智超; 徐涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-10-13

Abstract

The disclosure provides a voice interaction method, a voice interaction device, voice interaction equipment and a storage medium. Performing wake-up detection on the received first voice; acquiring a first voiceprint feature of a first voice under the condition of successful awakening; receiving a second voice after the first voice; determining a voice recognition result of a voice part matched with the first voiceprint feature in the second voice; and providing the service for the user based on the voice recognition result. Therefore, the problem of speech recognition error caused by the fact that speakers cannot be distinguished can be solved.

Description

Voice interaction method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a voice interaction method, apparatus, device, and storage medium.

Background

The voice interaction belongs to the category of human-computer interaction, and is a leading-edge interaction mode developed from the human-computer interaction to the present. Voice interaction is the process by which a user gives instructions to a machine through natural language to achieve his or her own objectives.

The current voice interaction scheme mainly utilizes features extracted from the frequency spectrum of the voice input by an ASR (Automatic Speech Recognition) technology to recognize text content. Therefore, if other people except the user speak nearby during the voice interaction process, voice recognition errors can be caused, and the user experience is reduced.

For example, in a voice scene of the smart speaker, different from a mobile phone terminal, the smart speaker may receive voice signals from all users from all directions, so that in the scene of the smart speaker, when a user makes a command to the smart speaker, if some other people are speaking, voice recognition may be wrong, and poor user experience may be brought.

Therefore, a voice interaction scheme capable of improving the accuracy of voice recognition is required.

Disclosure of Invention

An object of the present disclosure is to provide a voice interaction scheme capable of improving accuracy of voice recognition.

According to a first aspect of the present disclosure, a voice interaction method is provided, including: performing wake-up detection on the received first voice; acquiring a first voiceprint feature of a first voice under the condition of successful awakening; receiving a second voice after the first voice input; determining a voice recognition result of the voice input part matched with the first voiceprint characteristic in the second voice; and providing the service for the user based on the voice recognition result.

Optionally, the step of determining a speech recognition result of the speech portion of the second speech matching the first voiceprint feature comprises: recognizing the text content of the second voice; acquiring second voiceprint characteristics of a voice fragment in second voice corresponding to each character or word in the text content; determining a voice segment of which the similarity of the second voiceprint feature and the first voiceprint feature is greater than a preset threshold value; and obtaining a voice recognition result based on the determined characters or words corresponding to the voice segments.

Optionally, the method further comprises: and removing the voice recognition result of the voice part which is not matched with the first voiceprint characteristic in the second voice.

Optionally, the step of removing the speech recognition result of the speech part in the second speech that does not match the first voiceprint feature includes: recognizing the text content of the second voice; acquiring second voiceprint characteristics of a voice fragment in the second voice corresponding to each character or word in the text content; and removing the characters or words corresponding to the voice segments with the similarity between the second voiceprint feature and the first voiceprint feature smaller than a preset threshold value.

Optionally, the first voiceprint feature is a voiceprint feature of a portion of the first speech corresponding to the wake word.

Optionally, the method further comprises: comparing the first voiceprint features with acoustic features in a voiceprint feature library, wherein the acoustic features in the voiceprint feature library are acoustic features of registered users; and under the condition that the acoustic features matched with the first voiceprint features do not exist in the voiceprint feature library, registering the first voiceprint features as new users, and storing the first voiceprint features in the voiceprint feature library.

According to the second aspect of the present disclosure, a voice interaction method is also provided, including: performing wake-up detection on the received voice; under the condition of successful awakening, acquiring a first voiceprint feature of a first voice part corresponding to an awakening word in the voice; determining a speech recognition result of a second speech part matched with the first voiceprint feature in the speech; and providing the service for the user based on the voice recognition result.

Optionally, the step of determining a speech recognition result of a second speech portion in the speech matching the first voiceprint feature comprises: recognizing text content of the speech part; acquiring a second voiceprint characteristic of a voice fragment in voice corresponding to each word or word in text content; determining a voice segment of which the similarity of the second voiceprint feature and the first voiceprint feature is greater than a preset threshold value; and obtaining a voice recognition result based on the determined characters or words corresponding to the voice segments.

Optionally, the method further comprises: and removing a voice recognition result of a third voice part which does not match with the first voiceprint characteristic in the voice.

Optionally, the step of removing the speech recognition result of the third speech part in the speech that does not match the first voiceprint feature includes: recognizing text content of the voice; acquiring a second voiceprint characteristic of a voice fragment in the voice corresponding to each word or phrase in the text content; and removing the characters or words corresponding to the voice segments with the similarity between the second voiceprint feature and the first voiceprint feature smaller than a preset threshold value.

According to a third aspect of the present disclosure, there is also provided a voice interaction method, including: performing wake-up detection on the received first voice; acquiring a first acoustic feature of a first voice under the condition of successful awakening; receiving a second voice after the first voice input; determining a speech recognition result of a speech part matched with the first acoustic feature in the second speech; and providing the service for the user based on the voice recognition result.

According to a fourth aspect of the present disclosure, there is also provided a voice interaction method, including: performing wake-up detection on the received voice; under the condition that awakening is successful, acquiring first acoustic features of a first voice part corresponding to awakening words in voice; determining a speech recognition result of a second speech part matched with the first acoustic feature in the speech; and providing the service for the user based on the voice recognition result.

According to a fifth aspect of the present disclosure, there is also provided an electronic device for providing a voice interaction service, comprising: the voice receiving device is used for receiving the voice of a user; the wake-up detection device is used for performing wake-up detection on the received voice; the acoustic feature acquisition device is used for acquiring a first voiceprint feature of a first voice part corresponding to a wakeup word in voice under the condition of successful wakeup; a speech recognition result determining device for determining a speech recognition result of a second speech portion in the speech that matches the first voiceprint feature; and the service device is used for providing services for the user based on the voice recognition result.

Optionally, the electronic device is a smart speaker.

According to a sixth aspect of the present disclosure, there is also provided a voice interaction apparatus, including: the wake-up detection module is used for performing wake-up detection on the received first voice; the acquiring module is used for acquiring a first acoustic feature of the first voice under the condition of successful awakening; the receiving module is used for receiving a second voice after the first voice; the determining module is used for determining a voice recognition result of the voice input part matched with the first voiceprint feature in the second voice; and the service module is used for providing service for the user based on the voice recognition result.

According to a seventh aspect of the present disclosure, there is also provided a voice interaction apparatus, including: the wake-up detection module is used for performing wake-up detection on the received voice; the acquiring module is used for acquiring a first acoustic feature of a first voice part corresponding to a wake-up word in voice under the condition that the wake-up is successful; the determining module is used for determining a voice recognition result of a second voice part matched with the first acoustic feature in the voice; and the service module is used for providing service for the user based on the voice recognition result.

According to an eighth aspect of the present disclosure, there is also presented a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as set forth in any one of the first to fourth aspects of the disclosure.

According to a ninth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in any one of the first to fourth aspects of the present disclosure.

According to the voice recognition method and device, after awakening is successful, the acoustic characteristics (such as voiceprints) of the awakened person are obtained, then the voice signals only belonging to the awakened person are screened out according to the acoustic characteristics of the awakened person to carry out voice recognition during voice recognition, and therefore the problem of voice recognition errors caused by the fact that the speaker cannot be distinguished can be solved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic flow chart diagram of a voice interaction method according to an embodiment of the present disclosure.

Fig. 2 is a schematic flow chart illustrating a voice interaction process in an application scenario of a smart sound box.

FIG. 3 shows a schematic flow chart diagram of a voice interaction method according to another embodiment of the present disclosure.

Fig. 4 shows a schematic block diagram of the structure of an electronic device according to an embodiment of the present disclosure.

Fig. 5 shows a schematic block diagram of the structure of a voice interaction device according to an embodiment of the present disclosure.

FIG. 6 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In a voice interaction scene, the voice interaction device may receive voices of multiple speakers, and when someone makes a voice command, if someone else is speaking nearby, voice recognition may be wrong, and the voice interaction experience of the user is reduced. For example, in a usage scenario of a smart sound box, the user that the smart sound box faces is not a fixed single user, but may be multiple users. Moreover, because the intelligent sound box can receive voice signals from all directions, if someone commands the sound box and someone else speaks nearby, voice recognition errors can be caused, and correct operation of a subsequent dialogue system of the sound box is affected.

In view of the above problems, the inventors of the present disclosure found, after intensive research, that in a voice interaction scenario, a awakener generally belongs to the same person as a speaker who subsequently issues a voice command. For example, in a usage scenario of the smart speaker, a user generally needs to speak a wakeup word first and then speak a specific voice instruction to instruct the smart speaker to execute a corresponding service (e.g., a song playing service).

In view of this, the present disclosure provides that after the device is successfully awakened, the acoustic feature of the awakener can be obtained, and when performing voice recognition, the voice signal only belonging to the awakener can be screened out according to the acoustic feature of the awakener, and voice recognition is performed according to the screened out voice signal. Therefore, the problem of voice recognition error caused by incapability of distinguishing the speaker can be solved by tracking the acoustic characteristics of the speaker during voice awakening.

The acoustic features referred to in this disclosure are preferably voiceprint features. As alternative embodiments, the acoustic features may also be other types of acoustic features such as volume, pitch, timbre, etc. For example, since the tone of the girl is high and the tone of the boy is low, when the awakening person is the girl and other speakers around the awakening person are the boys, the voice signal belonging only to the awakening person can be screened out according to the tone of the awakening command issued by the awakening person. For another example, under the condition that the distances between the awakener and the surrounding speakers and the voice interaction device are different, or the volume of the awakener is obviously different from that of the surrounding speakers, the voice signal only belonging to the awakener can be screened out according to the volume of the awakening instruction sent by the awakener.

The voice interaction scheme of the present disclosure is exemplified below by taking an acoustic feature as a voiceprint feature as an example. It should be understood that in the schemes described below, the voiceprint features can also be replaced with acoustic features such as volume, pitch, timbre, etc.

FIG. 1 shows a schematic flow chart diagram of a voice interaction method according to an embodiment of the present disclosure. The method shown in fig. 1 may be performed by a device supporting voice interaction services, for example, a smart speaker.

Referring to fig. 1, in step S110, wake-up detection is performed for a received first voice.

The wake-up detection mainly detects whether a specific wake-up word exists in the first voice. When the wake-up word is detected to exist in the first voice, the wake-up success can be judged, and then a subsequent voice recognition process can be carried out to recognize the voice interaction instruction of the user. In case that it is detected that no specific wake-up word exists in the first speech, the wake-up fails, and for the speech received thereafter, the wake-up detection procedure may be continued.

Wake-up detection may be performed in a variety of ways. For example, in performing wake up detection, the received first speech may be first subjected to signal processing and feature extraction to convert the audio into features (such as MFCC features), and the features may enter a wake up engine for determining whether a wake up word is hit, and when the wake up word is hit, it may be determined that the wake up is successful. For another example, the first speech may also be detected by using a pre-constructed wake detection model, where the wake detection model may be a neural network model. The specific implementation means of wake-up detection is well-known in the art and will not be described herein.

In step S120, in case of successful wake-up, a first voiceprint feature of the first voice is acquired.

The first voiceprint feature may be a voiceprint feature of a voice portion of the first voice corresponding to the wake-up word, that is, a voiceprint feature of a voice portion of the first voice detected as the wake-up word. The first voiceprint feature can be used to characterize a person being woken up.

In step S130, a second voice subsequent to the first voice is received.

In step S140, a speech recognition result of a speech portion in the second speech matching the first voiceprint feature is determined.

In step S150, a service is provided to the user based on the voice recognition result.

And aiming at the second voice received after the first voice, providing service for the user according to the voice recognition result of the voice part matched with the first voiceprint feature in the second voice. Therefore, under the condition that the second voice comprises voice input of a plurality of speakers, the voice part matched with the voiceprint feature of the awakener can be selected from the second voice according to the first voiceprint feature of the awakener, and corresponding service is provided for the user based on the voice recognition result of the selected voice part.

As an example of the present disclosure, the text content of the second voice may be recognized first by using a voice recognition technology, and after the text content of the second voice is recognized, each recognized word or word may be associated with a corresponding voice segment in the second voice. Therefore, the second voiceprint feature of the voice fragment in the second voice corresponding to each word or word in the text content can be obtained, and then the second voiceprint feature of the voice fragment in the second voice corresponding to each word or word is compared with the first voiceprint feature, so that the similarity between the second voiceprint feature of the voice fragment in the second voice corresponding to each word or word and the first voiceprint feature is obtained. A speech segment in which the similarity between the second voiceprint feature and the first voiceprint feature is greater than a predetermined threshold may be determined as a speech portion matching the first voiceprint feature, and thus a speech recognition result may be obtained based on a word or word corresponding to the determined speech segment.

As another example of the present disclosure, the voice recognition result of the voice portion of the second voice that does not match the first voiceprint feature may also be removed, so that the voice recognition result that is finally retained is the voice recognition result of the voice input portion that matches the first voiceprint feature. For example, the text content of the second speech may be recognized using speech recognition technology, and after the text content of the second speech is recognized, the recognized words or phrases may be associated with corresponding speech segments in the second speech. Then, a second voiceprint feature of the voice fragment in the second voice corresponding to each word or word in the text content can be obtained, and the word or word corresponding to the voice fragment with the similarity between the second voiceprint feature and the first voiceprint feature smaller than a preset threshold value is removed. Thus, based on the finally retained word or word, a speech recognition result of the speech portion matching the first voiceprint feature can be obtained.

In the present disclosure, the text content of the second speech may be recognized using speech recognition techniques. The specific implementation means of the voice recognition technology is the mature technology in the field. The following is merely an exemplary description of the process of recognizing the text content of the second speech using speech recognition technology.

Generally, the second speech may be subjected to a framing process, and divided into multiple frames of audio, where the length of each frame of audio may be set according to actual conditions. Feature extraction may then be performed on each frame of audio, as may MFCC features. Then, the frame audio may be recognized as a state according to the extracted features, the recognized states may be combined into phonemes, and the phonemes may be combined into words or words, so that text content corresponding to the second speech may be finally obtained.

As an example, the second speech may be divided into windows with equal length, then the features are calculated in each window, the calculated features may enter the acoustic model, the probability distribution of each phoneme is calculated by the acoustic model, and finally the result enters the decoder to obtain the final decoding result, so that the text content corresponding to the second speech may be obtained. When decoding, Viterbi search can be performed by using WFST model to obtain the final decoding result. WFST (Weighted Fine-stationary driver) is used for large-scale speech recognition, including HMM models, dictionaries, n-gram language models. The Viterbi algorithm is used to find the most likely hidden state sequence.

When decoding is performed by using the WFST model, the WFST model used is composed of four types of WFST network cascades: 1. HMM acoustic model WFST (H for short); 2. context-dependent WFST (abbreviated C); 3. a pronunciation dictionary WFST (abbreviated as L); 4. language model WFST (abbreviated G). H is the acoustic model state, C is the phone's context map, L is the pronunciation dictionary, G is the language model, and finally the HCLG network is synthesized. The acoustic model predicts the probability of finally finding words and the probability of corresponding sentences through searching on the upper layer of the HCLG network. In the dynamic decoder, the WFST model is divided into two parts, namely, HCL network and G network, the model state passes through the HCL network to obtain the probability sequence of words, and then passes through the G network to obtain the probability of sentences. In an end-to-end (end-to-end) acoustic model, a TLG structure is adopted, T is equivalent to the original HC part and represents a token network, and in dynamic decoding, the TLG structure is similarly divided into two networks of TL and G. The WFST model and the TLG network are well-established in the art and will not be described herein.

Referring to fig. 2, when a person speaks into the sound box (1), a wake-up flow may be first walked to detect a wake-up word, if wake-up is detected, audio is further received and recognized, before recognition, features of the audio corresponding to the detected wake-up word are first extracted (2), a corresponding voiceprint is calculated (4), then the calculated voiceprint may be matched with a pre-registered voiceprint model (5), for example, a similarity score with each registered voiceprint model may be calculated (6), wherein each voiceprint model corresponds to one user (i.e., a wake-up person), and in the presence of a voiceprint model whose similarity with the calculated voiceprint is higher than a predetermined threshold, the user corresponding to the voiceprint model is a wake-up person. And under the condition that the voiceprint model with the similarity higher than the preset threshold value with the calculated voiceprint does not exist, registering (3) the calculated voiceprint as a new voiceprint model, namely registering a new user, wherein the registered new user is the awakener. After finding the corresponding awakener (7), the awakener information is reserved for later voice recognition and comparison and use when the awakener (18) is awakened.

After the awakening work is finished, continuously receiving a speaking signal (8) of a user, calculating an acoustic model division (9) through an acoustic model, filling the acoustic model division into a TL/HCL network for Viterbi search (10) to obtain the probability of each character or word, calculating the voiceprint (13) of the extracted character or word corresponding characteristics (12), matching the calculated voiceprint with the registered voiceprint models (14), and if the similarity score (15) of the voiceprint and each registered voiceprint model can be calculated, obtaining the voiceprint model corresponding to the voiceprint of the characteristics corresponding to each character or word, namely obtaining the awakened character person (16) corresponding to each character or word.

Then the awakener corresponding to each word or phrase is compared (18) with the awakener obtained during the previous awakening detection, the vocabulary (17) which does not belong to the awakener obtained during the awakening detection is filtered, all the uttered contents which are not uttered with the destiny are filtered, the rest words are continuously poured into the G network for the final decoding (19) to obtain the sentence probability (20), and finally the voice interaction service can be provided for the user based on the finally obtained sentence.

In one embodiment of the present disclosure, at the time of wake-up detection, the first voiceprint feature may be further compared with an acoustic feature in a voiceprint feature library, where the acoustic feature in the voiceprint feature library is an acoustic feature of a registered user. Thus, it is also possible to determine whether or not the user who uttered the first voice is a registered user. Alternatively, the true wake-up may be successful only if the wake-up word is present in the first voice and the user of the first voice is a registered user. Optionally, in a case that the acoustic feature matching the first voiceprint feature does not exist in the voiceprint feature library, the first voiceprint feature may be registered as a new user, and the first voiceprint feature may be saved in the voiceprint feature library.

FIG. 3 shows a schematic flow chart diagram of a voice interaction method according to another embodiment of the present disclosure. The method shown in fig. 3 may be performed by a device supporting voice interaction services, for example, a smart speaker.

Different from the voice interaction method described above with reference to fig. 1, the voice interaction scenario supported by the voice interaction method described above with reference to fig. 1 is that the user first speaks a wakeup word to perform device wakeup, and then issues a voice instruction to instruct the device to perform corresponding operation. Taking the smart sound box as an example, the user needs to send out voice according to the mode of "wake-up word + voice interaction instruction" to interact with the smart sound box.

In the embodiment described below with reference to fig. 3, the user may first speak the wakeup word to perform device wakeup and then issue the voice command to instruct the device to perform corresponding operation, or may first issue the voice command and then speak the wakeup word. That is to say, the user can make a voice according to the mode of "wake-up word + voice interaction instruction" to implement interaction, and can also make a voice according to the mode of "voice interaction instruction + wake-up word" to implement interaction. For example, if the awakening word of the smart sound box is "small a", the user may instruct the sound box to play music by saying "small a" to play a song for me, or may instruct the sound box to play music by saying "to play a song for me, small a".

Referring to fig. 3, in step S210, wake-up detection is performed for received voice.

Speech as referred to herein may be speech detected from the detection of speech activity to the end of speech activity. The received speech may include speech of one or more persons.

And performing wake-up detection on the received voice, mainly detecting whether a specific wake-up word exists in the voice. And under the condition that the awakening word is detected to exist in the voice, the awakening success can be judged, and then a subsequent voice recognition process can be carried out to recognize the voice interaction instruction of the target user from the voice. In case that it is detected that no specific wake-up word exists in the speech, the wake-up fails, and for the speech received thereafter, the wake-up detection procedure may be continued. For wake-up detection, see the above description, the details are not repeated here.

In step S220, if the wake-up is successful, a first voiceprint feature of a first voice portion corresponding to the wake-up word in the voice is obtained.

The first voiceprint feature is also the voiceprint feature of the speech segment in the speech that is detected as the wake-up word. The first voiceprint feature can be used to characterize a person being woken up.

In step S230, a speech recognition result of a second speech portion in the speech matching the first voiceprint feature is determined.

At step 240, a service is provided to the user based on the speech recognition result.

In the case that the voice includes voices of a plurality of speakers, a voice part matched with the voiceprint feature of the awakened person can be selected from the first voiceprint features, and corresponding services are provided for the user based on a voice recognition result of the selected voice part.

As an example of the present disclosure, the text content of the speech may be recognized first by using a speech recognition technology, and after the text content of the speech is recognized, each recognized word or phrase may be associated with a corresponding speech segment in the speech. Therefore, the second voiceprint feature of the voice fragment in the voice corresponding to each word or word in the text content can be obtained, and then the second voiceprint feature of the voice fragment in the voice corresponding to each word or word is compared with the first voiceprint feature to obtain the similarity between the second voiceprint feature and the first voiceprint feature of the voice fragment in the voice corresponding to each word or word. And finally, obtaining a voice recognition result based on the characters or words corresponding to the voice segments with the similarity between the second voiceprint feature and the first voiceprint feature being greater than the preset threshold value.

As another example of the present disclosure, a speech recognition result of a speech portion of the speech that does not match the first voiceprint feature may also be removed, whereby a speech recognition result that is finally retained is a speech recognition result of a speech portion that matches the first voiceprint feature. For example, the text content of the speech may be recognized using speech recognition techniques, and after the text content of the speech is recognized, each recognized word or phrase may be associated with a corresponding speech segment in the speech. Then, a second voiceprint feature of a voice fragment in the voice corresponding to each word or phrase in the text content can be obtained, and the word or phrase corresponding to the voice fragment with the similarity between the second voiceprint feature and the first voiceprint feature smaller than a preset threshold value is removed. Thus, based on the finally retained word or word, a speech recognition result of the second speech portion matching the first voiceprint feature can be obtained.

The implementation process of recognizing the text content of the speech by using the speech recognition technology can refer to the above related description, and is not described herein again.

Fig. 4 shows a schematic block diagram of the structure of an electronic device according to an embodiment of the present disclosure. The electronic device 400 shown in fig. 4 may be any electronic device supporting voice interaction services, such as a smart speaker.

Referring to fig. 4, the electronic device 400 includes a voice receiving means 410, a wake-up detecting means 420, an acoustic feature acquiring means 430, a voice recognition result determining means 440, and a service means 450.

In one embodiment of the present disclosure, the voice receiving device 410 is used for receiving the voice of the user. The wake-up detection means 420 is used for wake-up detection for the received voice. The acoustic feature obtaining device 430 is configured to obtain a first voiceprint feature of a first voice portion corresponding to a wakeup word in the voice if the wakeup is successful. The speech recognition result determining means 440 is configured to determine a speech recognition result of a second speech portion in the speech that matches the first voiceprint feature. The service device 450 is used for providing services for the user based on the voice recognition result.

The specific implementation manner of the electronic device 400 according to the exemplary embodiment of the present disclosure may be implemented by referring to the related specific implementation manner described above in conjunction with fig. 3, and is not described herein again.

In another embodiment of the present disclosure, the voice receiving device 410 may receive the first voice, and the wake-up detecting device 420 may be configured to perform wake-up detection on the received first voice. In case the wake-up is successful, the acoustic feature obtaining device 430 may obtain a first voiceprint feature of the first voice. The voice receiving apparatus 410 may also continue to receive a second voice after the first voice. The speech recognition result determining means 440 may determine a speech recognition result of a speech portion of the second speech matching the first voiceprint feature. The service device 450 may provide a service to the user based on a speech recognition result of a speech portion of the second speech that matches the first voiceprint feature.

The specific implementation manner of the electronic device 400 according to the exemplary embodiment of the present disclosure may be implemented by referring to the related specific implementation manner described above in conjunction with fig. 1, and is not described herein again.

Fig. 5 shows a schematic block diagram of the structure of a voice interaction device according to an embodiment of the present disclosure. Wherein the functional blocks of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 5 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the voice interaction apparatus can have and operations that each functional module can perform are briefly described, and details related thereto may be referred to the above description, and are not repeated here.

Referring to fig. 5, the voice interaction apparatus 500 includes a wake-up detection module 510, an acquisition module 520, a reception module 530, a determination module 540, and a service module 550.

In one embodiment of the disclosure, the wake detection module 510 is configured to perform wake detection on the received first voice. The obtaining module 520 is configured to obtain a first acoustic feature of the first voice if the wake-up is successful. The receiving module 530 is configured to receive a second voice after the first voice. The determining module 540 is configured to determine a speech recognition result of the speech input portion in the second speech matching the first voiceprint feature. The service module 550 is used for providing services for the user based on the voice recognition result.

Alternatively, the determination module 540 may include a recognition module, a second acoustic feature acquisition module, an audio piece determination module, and a recognition result acquisition module. The recognition module is used for recognizing the text content of the second voice. The second acoustic feature obtaining module is used for obtaining a second acoustic feature of the voice segment in the second voice corresponding to each character or word in the text content. The audio segment determination module is used for determining the voice segment of which the similarity of the second voiceprint feature and the first voiceprint feature is greater than a preset threshold value. The recognition result acquisition module is used for acquiring a voice recognition result based on the determined characters or words corresponding to the voice segments.

Optionally, the voice interaction apparatus 500 may further include a removing module, configured to remove a voice recognition result of a voice portion of the second voice that does not match the first acoustic feature.

In another embodiment of the present disclosure, the wake detection module 510 is configured to perform wake detection on the voice received by the receiving module 530. The obtaining module 520 is configured to obtain a first acoustic feature of a first voice portion corresponding to a wakeup word in a voice if the wakeup is successful. The determining module 540 is configured to determine a speech recognition result of a second speech portion in the speech that matches the first acoustic feature. The service module 550 is used for providing services for the user based on the voice recognition result.

Alternatively, the determination module 540 may include a recognition module, a second acoustic feature acquisition module, an audio piece determination module, and a recognition result acquisition module. The recognition module is used for recognizing the text content of the voice. The second acoustic feature obtaining module is used for obtaining a second acoustic feature of the voice segment in the voice corresponding to each word or phrase in the text content. The audio segment determining module is used for determining a voice segment of which the similarity between the second acoustic feature and the first acoustic feature is greater than a preset threshold value, and the recognition result obtaining module is used for obtaining a voice recognition result based on a word or a word corresponding to the determined voice segment.

Optionally, the voice interaction apparatus 500 may further include a removal module for removing a voice recognition result of a voice portion of the voice that does not match the first acoustic feature.

As an example, the voice interaction apparatus 500 may further include a comparison module and a registration module. The comparison module is used for comparing the first acoustic features with acoustic features in an acoustic feature library, wherein the acoustic features in the acoustic feature library are acoustic features of registered users. The registration module is used for registering the first acoustic feature as a new user under the condition that the acoustic feature matched with the first acoustic feature does not exist in the acoustic feature library, and storing the first acoustic feature into the acoustic feature library.

The specific implementation manner of the voice interaction apparatus 500 according to the exemplary embodiment of the present disclosure may be implemented by referring to the related specific implementation manner described above in conjunction with fig. 1 to 3, and is not described herein again.

FIG. 6 is a schematic structural diagram of a computing device that can be used to implement the voice interaction method according to an embodiment of the present disclosure.

Referring to fig. 6, computing device 600 includes memory 610 and processor 620.

The processor 620 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 610 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform the voice interaction methods described above.

The voice interaction method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above-mentioned steps defined in the above-mentioned method of the present disclosure.

Alternatively, the present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of voice interaction, comprising:

performing wake-up detection on the received first voice;

acquiring a first voiceprint feature of the first voice under the condition of successful awakening;

receiving a second voice subsequent to the first voice;

determining a voice recognition result of a voice part matched with the first voiceprint feature in the second voice;

and providing services for the user based on the voice recognition result.

2. The method of claim 1, wherein the step of determining the result of the speech recognition of the portion of the second speech input that matches the first voiceprint feature comprises:

recognizing the text content of the second voice;

acquiring second voiceprint characteristics of a voice fragment in the second voice corresponding to each character or word in the text content;

determining a voice segment of which the similarity of the second voiceprint feature and the first voiceprint feature is greater than a preset threshold;

and obtaining the voice recognition result based on the characters or words corresponding to the determined voice segments.

3. The voice interaction method of claim 1, further comprising:

and removing the voice recognition result of the voice part which is not matched with the first voiceprint characteristic in the second voice.

4. The method according to claim 3, wherein the step of removing the speech recognition result of the speech part of the second speech not matching the first voiceprint feature comprises:

recognizing the text content of the second voice;

and removing the characters or words corresponding to the voice segments with the similarity between the second voiceprint feature and the first voiceprint feature smaller than a preset threshold value.

5. The voice interaction method of claim 1,

the first voiceprint feature is a voiceprint feature of a voice portion corresponding to the wake-up word in the first voice.

6. The voice interaction method of claim 1, further comprising:

comparing the first voiceprint feature with acoustic features in a voiceprint feature library, wherein the acoustic features in the voiceprint feature library are acoustic features of registered users;

and under the condition that the acoustic features matched with the first voiceprint features do not exist in the voiceprint feature library, registering the first voiceprint features as new users, and storing the first voiceprint features in the voiceprint feature library.

7. A method of voice interaction, comprising:

performing wake-up detection on the received voice;

under the condition of successful awakening, acquiring a first voiceprint feature of a first voice part corresponding to an awakening word in the voice;

determining a speech recognition result of a second speech portion in the speech that matches the first voiceprint feature;

and providing services for the user based on the voice recognition result.

8. The method of claim 7, wherein the step of determining the result of the speech recognition of the second portion of speech in the speech that matches the first voiceprint feature comprises:

recognizing text content of the voice;

acquiring a second voiceprint characteristic of a voice fragment in the voice corresponding to each word or phrase in the text content;

9. The voice interaction method of claim 7, further comprising:

and removing a voice recognition result of a third voice part which does not match with the first voiceprint characteristic in the voice.

10. The method according to claim 9, wherein the step of removing the voice recognition result of the third voice portion of the voice not matching the first voiceprint feature comprises:

recognizing text content of the voice;

11. The voice interaction method of claim 7, further comprising:

12. A method of voice interaction, comprising:

performing wake-up detection on the received first voice;

acquiring a first acoustic feature of the first voice under the condition of successful awakening;

receiving a second voice subsequent to the first voice input;

determining a speech recognition result of a speech portion of the second speech that matches the first acoustic feature;

and providing services for the user based on the voice recognition result.

13. A method of voice interaction, comprising:

performing wake-up detection on the received voice;

under the condition that awakening is successful, acquiring first acoustic features of a first voice part corresponding to awakening words in the voice;

determining a speech recognition result of a second speech portion in the speech that matches the first acoustic feature;

and providing services for the user based on the voice recognition result.

14. An electronic device for providing voice interaction services, comprising:

the voice receiving device is used for receiving the voice of a user;

the wake-up detection device is used for performing wake-up detection on the received voice;

the acoustic feature acquisition device is used for acquiring a first voiceprint feature of a first voice part corresponding to a wakeup word in the voice under the condition of successful wakeup;

a speech recognition result determining device for determining a speech recognition result of a second speech portion in the speech that matches the first voiceprint feature;

and the service device is used for providing services for the user based on the voice recognition result.

15. The electronic device of claim 14, wherein the electronic device is a smart speaker.

16. A voice interaction apparatus, comprising:

the wake-up detection module is used for performing wake-up detection on the received first voice;

the obtaining module is used for obtaining a first acoustic feature of the first voice under the condition of successful awakening;

the receiving module is used for receiving a second voice after the first voice;

the determining module is used for determining a voice recognition result of the voice input part matched with the first voiceprint feature in the second voice;

and the service module is used for providing service for the user based on the voice recognition result.

17. A voice interaction apparatus, comprising:

the wake-up detection module is used for performing wake-up detection on the received voice;

the acquiring module is used for acquiring a first acoustic feature of a first voice part corresponding to a wake-up word in the voice under the condition that the wake-up is successful;

the determining module is used for determining a voice recognition result of a second voice part matched with the first acoustic feature in the voice;

18. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 13.

19. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-13.