CN116564316B - Voice man-machine interaction method and device - Google Patents

Voice man-machine interaction method and device Download PDF

Info

Publication number
CN116564316B
CN116564316B CN202310843070.2A CN202310843070A CN116564316B CN 116564316 B CN116564316 B CN 116564316B CN 202310843070 A CN202310843070 A CN 202310843070A CN 116564316 B CN116564316 B CN 116564316B
Authority
CN
China
Prior art keywords
information
current
instruction
historical
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310843070.2A
Other languages
Chinese (zh)
Other versions
CN116564316A (en
Inventor
钟雨崎
艾国
杨作兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bianfeng Information Technology Co ltd
Original Assignee
Beijing Bianfeng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bianfeng Information Technology Co ltd filed Critical Beijing Bianfeng Information Technology Co ltd
Priority to CN202310843070.2A priority Critical patent/CN116564316B/en
Publication of CN116564316A publication Critical patent/CN116564316A/en
Application granted granted Critical
Publication of CN116564316B publication Critical patent/CN116564316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a voice man-machine interaction method and a device, wherein the method comprises the following steps: acquiring a current voice signal and performing voice detection, performing voice content recognition on the detected voice under the condition that the voice is detected, and generating a current instruction based on the included instruction words under the condition that the recognized voice content includes the instruction words in a current allowed instruction word set, wherein the current allowed instruction word set is determined according to the current running state of the controlled device, and checking the current instruction by utilizing historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: and determining whether to execute the current instruction according to the instruction information generated in the voice man-machine interaction process, and recording the current instruction information as the interaction information. The application realizes the reliable interaction under the condition of no wake-up word.

Description

Voice man-machine interaction method and device
Technical Field
The application relates to the field of intelligent home, in particular to a voice man-machine interaction method and device.
Background
Along with the development of voice recognition and keyword recognition technologies, the electronic intelligent equipment has basically the capability of voice man-machine interaction, but the current voice man-machine interaction needs to be awakened and confirmed firstly to execute the voice issuing instruction.
Referring to fig. 1, fig. 1 is a schematic flow chart of an instruction executed by a wake-up party after a wake-up word is required in the prior art. The intelligent device detects whether voice is received or not, under the condition that voice is detected, a wake-up word in the detected voice is identified, if the wake-up word is correct, instruction content contained in the voice is obtained, under the condition that the instruction content is correctly identified, an instruction is executed, and when any step fails, no instruction is executed.
Therefore, in the existing voice man-machine interaction process, the wake-up words and instruction contents must be interacted by the identified party. For example, existing smart speakers in the market all need to call out wake-up words, such as: the interaction of XX eidolon, XX classmates and the like needs to be carried out according to the format, once the wake-up word is incorrect, no further response can be generated, the man-machine interaction mode is complex, and the user experience is poor.
Disclosure of Invention
The application provides a voice man-machine interaction method which can accurately realize expected man-machine interaction even without wake-up words.
The first aspect of the application provides a voice man-machine interaction method, which comprises the following steps:
acquiring current voice signal and detecting voice, and performing voice content recognition on the detected voice under the condition of detecting voice, wherein the current voice signal does not comprise any wake-up word,
in the case that the recognized speech content includes instruction words in a currently allowed instruction word set, generating a current instruction based on the included instruction words, wherein the currently allowed instruction word set is determined according to a current operation state of the controlled device,
and checking the current instruction by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: instruction information generated in the process of voice man-machine interaction in the past,
and determining whether to execute the current instruction according to the checking result.
Preferably, the verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information includes:
searching the historical instruction information which is the same as the current instruction information in the historical interaction information by taking the current instruction information as a searching basis,
Checking the current instruction information by utilizing the historical execution time length of the searched historical instruction information to obtain a first checking result;
and determining whether to execute the current instruction according to the checking result, including:
determining whether to execute the current instruction according to the first checking result, and recording the current instruction information as the interactive information;
the recording the current instruction information as the current interaction information comprises the following steps:
recording the current instruction information and the execution result of the current instruction as the interactive information,
wherein,,
the execution result of the current instruction at least comprises: the execution duration in the case where the current instruction is executed;
and the execution time length is determined according to the time interval between the instruction executed by the current interaction and the instruction executed by the last interaction.
Preferably, the verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information includes:
checking the current state information of the current instruction information by utilizing the historical running state information corresponding to the searched historical instruction information to obtain a second checking result;
and determining whether to execute the current instruction according to the checking result, including:
Determining whether to execute the current instruction according to the voting results of the first checking result and the second checking result,
and the current running state is used as the interactive information to record.
Preferably, in the case of detecting the voice, the method further comprises:
performing sound source localization on the current voice signal to obtain current sound source localization information; and/or
Acquiring current voiceprint information of a current voice signal;
the verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information comprises the following steps:
checking the current instruction information by utilizing the historical sound source positioning information corresponding to the searched historical instruction information to obtain a third checking result; and/or
Checking the current instruction information by utilizing the historical voiceprint information corresponding to the searched historical instruction information to obtain a fourth checking result;
and determining whether to execute the current instruction according to the checking result, including:
determining whether to execute the current instruction according to the voting result of each checking result,
and recording the current sound source positioning information and/or the current voiceprint information as the interactive information.
Preferably, the verifying the current instruction information by using the historical running state information corresponding to the searched historical instruction information includes:
Calculating the proportion of the historical operation state information which is the same as the current state information in the searched historical instruction information in all the historical operation states, wherein the larger the proportion value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the execution duration of the searched historical instruction information comprises the following steps:
counting the average value of the execution time of the searched historical instruction information, wherein the larger the average value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the historical sound source positioning information corresponding to the searched historical instruction information comprises the following steps:
carrying out similarity calculation on the current sound source positioning information and each historical sound source positioning information, and calculating an average value of each similarity, wherein the larger the average value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the historical voiceprint information corresponding to the searched historical instruction information comprises the following steps:
and carrying out Euclidean distance calculation on the current voiceprint information and each piece of historical voiceprint information, and calculating the average value of each Euclidean distance, wherein the smaller the average value is, the larger the confidence of the current instruction information is.
Preferably, the determining whether to execute the current instruction according to the voting result of each checking result includes:
if the average value of the execution time length is larger than the set first threshold value, the first voting result given to the first check result is valid,
if the average value of the similarity is larger than the set second threshold value, the third voting result given to the third checking result is valid, and/or if the average value of the Euclidean distance is larger than the set third threshold value, the fourth voting result given to the fourth checking result is valid,
and counting each effective voting result, and triggering to execute the current instruction under the condition that the number of the effective voting results is larger than a set number threshold value.
Preferably, the determining whether to execute the current instruction according to the voting result of the first checking result and the second checking result includes:
if the average value of the execution time length is larger than the set first threshold value, the first voting result given to the first check result is valid,
if the proportion value of the current state in all the historical operating states is larger than the set second threshold value, the second voting result given to the second checking result is valid,
and counting each effective voting result, and triggering to execute the current instruction under the condition that the number of the effective voting results is larger than a set number threshold value.
Preferably, the method further comprises:
checking whether the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction is larger than a set interval threshold value,
if yes, taking the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction as the execution time of the last instruction, marking the recorded last interaction information as a positive sample, otherwise, taking the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction as the execution time of the last instruction, marking the recorded last interaction information as a negative sample, or deleting the last interaction information; training a neural network model for information verification by using the recorded positive sample and negative sample to obtain a trained neural network model, or respectively training at least one neural network model of a first neural network model for sound source positioning information verification, a second neural network model for voiceprint information verification, a third neural network model for running state information verification and a fourth neural network model for current instruction information self verification to obtain each trained neural network model;
The verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information comprises the following steps:
and verifying at least one of the current sound source positioning information, the current voiceprint information, the current running state information and the current instruction information by the trained neural network model.
A second aspect of the present application provides a voice man-machine interaction device, the interaction device comprising:
a detection module for acquiring a current voice signal and detecting voice, performing voice content recognition on the detected voice under the condition that the voice is detected, generating a current instruction based on the included instruction word under the condition that the recognized voice content includes the instruction word in the current allowed instruction word set, wherein the current voice signal does not include any wake-up word in the current voice signal, the current allowed instruction word set is determined according to the current running state of the controlled equipment,
the verification module is used for verifying the current instruction by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: instruction information generated in the process of voice man-machine interaction in the past,
And the determining module is used for determining whether to execute the current instruction according to the checking result.
A third aspect of the application provides an electronic device comprising a memory storing a computer program and a processor configured to perform the steps of any of the voice human-machine interaction methods.
According to the voice man-machine interaction method provided by the application, the instruction words in the currently allowed instruction word set are determined according to the current running state of the controlled equipment, which are included in the recognized voice content, so that the execution of the instruction can be realized without waking up the words, the current instruction is checked by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, the inhibition of the false interaction is realized, and therefore, the expected man-machine interaction can be accurately realized even without waking up words, the user interaction experience is greatly improved, and the reliability of the interaction is improved.
Drawings
FIG. 1 is a flow chart of a conventional instruction executed by a wake-up party after a wake-up word is required.
Fig. 2 is a schematic flow chart of the voice man-machine interaction method of the present application.
Fig. 3 is a schematic flow chart of a voice man-machine interaction method according to the first embodiment of the application.
Fig. 4 is a flow chart of a voice man-machine interaction method according to a second embodiment of the application.
Fig. 5 is a schematic diagram of a voice man-machine interaction device according to an embodiment of the application.
Fig. 6 is a schematic diagram of a voice man-machine interaction device according to an embodiment of the application.
Fig. 7 is a schematic diagram of a voice man-machine interaction device according to an embodiment of the application.
Fig. 8 is another schematic diagram of a voice man-machine interaction device according to an embodiment of the application.
Detailed Description
The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent.
Referring to fig. 2, fig. 2 is a schematic flow chart of a voice man-machine interaction method according to the present application, where the method includes: on the side of the controlled device or the central control side,
step 201, obtaining a current voice signal and performing voice detection, and performing voice content recognition on the detected voice under the condition that the voice is detected, wherein the current voice signal does not comprise any wake-up word,
in case the recognized speech content comprises instruction words of the currently allowed instruction word set, step 202, generating a current instruction based on the included instruction words,
Wherein,,
the current allowed instruction word set is determined according to the current running state of the controlled equipment, the current allowed instruction word set is a subset of a pre-established instruction word set, and the pre-established instruction word set can be updated according to the historical interaction information.
The historical interaction information includes: the command information generated in the process of the traditional voice man-machine interaction is historical command information relative to the current interaction.
Step 203, checking the current instruction by using the same historical instruction information as the current instruction information in the historical interaction information,
step 204, determining whether to execute the current instruction according to the checking result, and recording the current instruction information as the current interaction information so as to form the history interaction information.
According to the embodiment of the application, the voice signal irrelevant to the controlled equipment can be filtered through the current allowed instruction word set determined according to the current running state of the controlled equipment, so that the equipment can be awakened without any awakening word, the current instruction is checked through the historical instruction information which is the same as the current instruction information in the historical interaction information, the false interaction instruction is effectively restrained, the reliability of voice man-machine interaction is improved, and the voice man-machine interaction without awakening words is realized.
For the sake of understanding the present application, the smart home system is taken as an example, and it should be understood that the embodiments of the present application are not limited to smart home systems, and are equally applicable to smart devices alone or systems with central control, such as vehicle central control.
Example 1
Referring to fig. 3, fig. 3 is a schematic flow chart of a voice man-machine interaction method according to a first embodiment of the application. On the central controller side, the method comprises the following steps:
step 301, determining a currently allowed instruction word set according to the current running state.
In a smart home system, as an example, a central controller obtains the current operating state of each smart device,
for example: the current smart home system is connected to the smart device A, B, C and is not running, and the current system is not performing any tasks, and the allowable instruction word sets include: opening A, opening B, opening C and other relevant initial instruction words of the system, and the anti-definition words or the anti-definition words of the corresponding instruction words in the allowed instruction sets such as closing A, closing B, closing C and the like are not included.
Also for example: the current smart home system is connected to the smart device A, B, C, and the smart device a is running, the smart device B, C is in an unoperated state, and the current system is not performing any task, and the allowable instruction word set includes: closing A, opening B, opening C and other related initial instruction words of the system, and the anti-definition words or the anti-definition words of the corresponding instruction words in the allowed instruction sets of opening A, closing B, closing C and the like are not included.
As another example, for a single smart device, the controller of the smart device obtains the current operating state of the smart device,
for example, if the smart device is currently in a closed state, then the supportable command vocabulary includes: for example, if the smart device is currently in an operating state and is in an operating state of a function, the supportable command word set includes: instruction words associated with the functional operational state.
Taking the intelligent sound box as an example, if the intelligent sound box is in a state of outputting audio, the supportable command word set includes: command words for controlling the volume level, e.g., increase sound, decrease sound, command words for controlling the sound effect mode, e.g., stereo surround, subwoofer, etc.
It should be appreciated that the allowed instruction word set may include ambiguous instruction words of similar meaning expression, e.g., similar instruction words corresponding to increasing sound may also include, and are not limited to: loud, loud dot(s), too little, inaudible, etc.
The allowed instruction word set is a subset of the system instruction word set, which may be pre-generated and maintained and updated according to the executed history instructions recorded during the history interactions.
Step 302, a current voice signal is obtained, voice detection is performed on the current voice signal, and in the case that voice is detected, voice content recognition is performed on the detected voice.
As one example, a Conformer neural network structure may be employed to implement voice content recognition.
It should be understood that step 301 and step 302 may not be in strict order, e.g., step 301 may follow step 302.
Step 303, according to the current allowed instruction word set, judging whether the recognized voice content contains the instruction word in the current allowed instruction word set,
if yes, generating current instruction information based on the instruction word, creating identification information for current interaction (current interaction), and recording the current instruction information and the identification information thereof as the current interaction information;
if the recognized voice content does not contain the instruction word in the current allowed instruction word set, the recognized voice content is indicated to have errors, and the process is ended.
Step 304, searching whether the history instruction information which is the same as the current instruction information exists in the history interaction information based on the current instruction information,
if so, step 305 is performed to check the current instruction,
Otherwise, the current instruction is executed, and step 307 is executed,
step 305, using the historical instruction information, checking the current instruction,
as an example, according to the current instruction information, the execution duration of each piece of historical instruction information identical to the current instruction information is counted, and the average value of the execution duration of each piece of historical instruction information is calculated to obtain a first check value, wherein the larger the average value is, the higher the confidence of the current instruction information is.
The execution duration can be determined according to the time interval between the instruction executed by the current interaction and the instruction executed by the last interaction.
Step 306, determining whether to execute the current instruction according to the first checking result.
As an example, if the first check value is greater than the set first threshold value, triggering the execution of the current instruction, and recording time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction;
otherwise, the execution of the current instruction is not triggered, and the current interaction is ended.
Step 307, according to the recorded time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction, judging whether the time interval is larger than the set interval threshold,
If yes, indicating that the last instruction information is expected by the user, taking the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction as the execution time of the last instruction, further marking the recorded last interaction information as a positive sample, storing the positive sample as historical interaction information,
otherwise, the instruction information generated by the last interaction is not expected by the user, the time interval information between the current interaction and the last interaction is taken as the execution time of the last instruction, the interaction information recorded by the last interaction is marked as a negative sample, or the interaction information recorded by the last interaction is deleted.
Through the sample mark, the historical interaction information data and the system instruction word set can be maintained. As an example, historical interaction information data, a system instruction word set, may be managed according to sample tags, e.g., negative sample data is reported for manual maintenance.
According to the embodiment, voice man-machine interaction under the condition of no wake-up word can be realized through the instruction word set allowed by the current running state, the current instruction information is checked through counting the execution time of each piece of historical instruction information which is the same as the current instruction information and calculating the average value of the execution time of each piece of historical instruction information, the probability of false interaction can be restrained, and the reliability of interaction is improved.
It should be understood that another implementation of step 305 is:
and 305', checking the current instruction information through the trained neural network model.
As an example, a neural network model for performing information verification is trained using the recorded positive and negative sample data, a trained neural network model is obtained, and execution of a current instruction is triggered according to a first verification result output by the neural network model.
Example two
In order to reduce false interaction of wake-up word interaction and improve reliability, the embodiment also determines whether to execute the instruction through sound source positioning information, voiceprint positioning information, running state information and instruction execution time.
Referring to fig. 4, fig. 4 is a schematic flow chart of a voice man-machine interaction method according to a second embodiment of the application. The method comprises the following steps:
step 401, determining a currently allowed instruction word set according to the current running state.
Step 402, obtaining a current voice signal, performing voice detection on the current voice signal, performing voice content recognition on the detected voice in case of detecting voice,
further, in order to reduce false interaction of wake-up word interaction, reliability is improved, sound source positioning is further performed, current sound source positioning information is obtained, and therefore position information of a detected voice source relative to instructed equipment is obtained.
In this step, because the intelligent devices are located differently, for example, the television is usually located in a living room, the water heater is usually located in a bathroom, and the user usually is near the instructed device when sending voice instructions to the devices, so that sound source localization is performed, which is beneficial to improving the reliability of the instructions and inhibiting the false interaction. As one example, a TDNN-LSTM neural network architecture may be employed to achieve sound source localization.
Preferably, voiceprint information is also available to distinguish between different users. Whereas the user sending voice instructions to these devices may be a different user, e.g., the voice instructions of members of the family are valid, the voice instructions of non-family users are invalid, e.g., the voice instructions of extraneous people, voice instructions of invalid users may be filtered out by voiceprint information, increasing the confidence of the voice instructions from the target user, thereby suppressing false interactions. As one example, a Resnet50 neural network architecture may be employed to obtain voiceprint information.
Step 403, according to the current allowed instruction word set, judging whether the recognized voice content contains the instruction word in the current allowed instruction word set,
If yes, generating current instruction information based on the instruction word, creating identification information for current interaction, and recording current voiceprint information, current sound source positioning information, current instruction information, time interval information between current interaction and last interaction, current running state information and created identification thereof;
if the recognized voice content does not contain the instruction word in the current allowed instruction word set, the recognized voice content is indicated to have errors, and the process is ended.
Step 404, searching whether the history instruction information which is the same as the current instruction information exists in the history interaction information based on the current instruction information,
if so, step 405 is performed,
otherwise, the current instruction is executed, and step 407 is executed,
step 405, based on the searched historical interaction information, checking the current sound source positioning information, the current voiceprint information, the current running state information and the current instruction information with the historical sound source positioning information, the historical voiceprint information, the historical running state information and the historical instruction information in the searched historical interaction information respectively,
as an example, the current sound source positioning information is compared with the historical sound source positioning information, for example, cosine similarity is calculated on the current voiceprint information and each historical voiceprint information, and an average value of each cosine similarity is calculated to obtain a third check value, wherein the higher the average value of the cosine similarity is, the more reliable the current sound source positioning information is, and the higher the confidence of the current instruction is.
Comparing the current voiceprint information with the historical voiceprint information, for example, performing Euclidean distance calculation on the current voiceprint information and each historical voiceprint information, and calculating the average value of each Euclidean distance to obtain a fourth check value, wherein the smaller the average value of the Euclidean distances is, the more reliable the current voiceprint information is, and the higher the confidence of the current instruction is.
Comparing the current running state with the historical running state information, for example, the proportion of the historical running state information which is the same as the current running state information in all the historical running states, to obtain a second check value, wherein the larger the proportion value is, the more the current running state is credible, and the higher the confidence of the current instruction is. For example, taking an air conditioning device as an example, the current state is a refrigerating state, the historical operation states include refrigeration, heating and dehumidification, and the proportion of the refrigerating state in all the historical operation states is calculated.
According to the current instruction information, counting the execution time length of each piece of historical instruction information which is the same as the current instruction information, and calculating the average value of the execution time length of each piece of historical instruction information to obtain a first check value, wherein the larger the average value is, the higher the confidence of the current instruction information is.
Step 406, determining whether to trigger the execution of the current instruction according to the voting result of each checking result.
As an example, the first voting result is given valid if the first check value is greater than a set first threshold value, the second voting result is given valid if the second check value is greater than a set second threshold value, the third voting result is given valid if the third check value is greater than a set third threshold value, and the fourth voting result is given valid if the fourth check value is greater than a set fourth threshold value;
and counting all effective voting results, triggering the execution of the current instruction under the condition that the counted effective voting results are larger than a set threshold value, recording time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction, otherwise, not triggering the execution of the current instruction, and ending the current interaction.
Step 407, according to the recorded time interval information between the current interaction and the last interaction, judging whether the time interval is greater than the set interval threshold,
if so, the current instruction is expected by the user, further, the interaction information recorded by the last interaction is marked as a positive sample, and the execution result of the last executed instruction is recorded, for example, the time interval information between the current interaction and the last interaction is taken as the execution duration of the last executed instruction,
Otherwise, the instruction information generated by the last interaction is not expected by the user, the interaction information recorded by the last interaction is marked as a negative sample, the time interval information between the current interaction and the last interaction is used as the execution time for recording the last execution instruction, or the interaction information recorded by the last interaction is deleted.
For example: when the instruction of turning on the television is executed, the user finds the wrong instruction, and immediately instructs the television to turn off so as to correct the wrong instruction. In this interaction, the time interval between the instruction of turning on the television and the instruction of turning off the television is relatively short, which means that the instruction of turning on the television is an misoperation, and the suppression is needed when the same situation is encountered later.
It should be understood that the verification of the current instruction by using the historical sound source positioning information and/or the historical voiceprint information can be used in the initial stage of interaction, and when the voice man-machine interaction reaches the set times or duration or the correctness of the voice man-machine interaction reaches the expectation, the verification of the current instruction by using the historical sound source positioning information and/or the historical voiceprint information can be omitted.
The embodiment checks the current instruction through various data in the history interaction information, is favorable for inhibiting error interaction and improves the reliability and accuracy of interaction.
It should be understood that another implementation manner of the steps 405 to 406 is as follows:
and step 405', checking the current sound source positioning information, the current voiceprint information, the current running state information and the current instruction information through the trained neural network model.
Step 406', determining whether to trigger execution of the current instruction according to the check result.
As an example, a neural network model for information verification is trained by using the recorded positive sample and negative sample data, so as to obtain a trained neural network model, and the model can verify the current sound source positioning information, the current voiceprint information, the current running state information and the current instruction information at the same time, and trigger the execution of the current instruction according to the verification result output by the neural network model.
As another example, training a first neural network model for sound source localization information verification, a second neural network model for voiceprint information verification, a third neural network model for running state information verification, and a fourth neural network model for current instruction information self verification by using the recorded positive sample and negative sample data respectively to obtain each trained neural network model; and determining whether to trigger the execution of the current instruction according to the verification result output by each trained neural network model or the weighted result of the verification result output by each trained neural network model.
As one example, the neural network model may be a classifier. Referring to fig. 5, fig. 5 is a schematic diagram of a voice man-machine interaction device according to an embodiment of the application. The device comprises:
a detection module for acquiring current voice signal and detecting voice, and performing voice content recognition on the detected voice under the condition that the voice is detected, and generating current instruction based on the included instruction word under the condition that the recognized voice content includes the instruction word in the current allowed instruction word set, wherein the current allowed instruction word set is determined according to the current running state of the controlled equipment,
the verification module is used for verifying the current instruction by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: instruction information generated in the process of voice man-machine interaction in the past,
and the determining module is used for determining whether to execute the current instruction according to the checking result.
As an example, the apparatus further comprises:
and the recording module is used for recording the interaction information of each interaction.
As an example, the detection module includes:
a voice recognition sub-module for performing voice content recognition on the detected voice,
An operating state sub-module for obtaining the current operating state of the controlled equipment,
an instruction detection sub-module for detecting whether instruction words in the currently allowed instruction word set are included in the recognized voice content,
the detection module further comprises:
a sound source positioning sub-module for performing sound source positioning on the current voice signal to obtain sound source positioning information, and/or
A voice print module sub-module for obtaining voice print information of the current voice signal,
the instruction detection sub-module is used for inputting the sound source positioning information of the sound source positioning sub-module, the voiceprint information of the voiceprint module sub-module and the running state information of the running state sub-module to the verification module under the condition that the recognized voice content comprises the instruction words in the current allowed instruction word set,
the recording module includes:
a sound source positioning information recording sub-module for recording sound source positioning information and its identification information in every interactive process,
a voiceprint information recording sub-module for recording voiceprint information and identification information thereof in each interactive process,
an execution time length recording sub-module for recording the execution time length of the instruction and the identification information thereof,
An operation state recording sub-module for recording the operation state of the controlled equipment and the identification information thereof in each interaction process,
an execution instruction recording sub-module for recording the instruction information and the identification information thereof executed in each interactive process,
the verification module comprises:
a searching sub-module for searching the historical instruction information which is the same as the current instruction information in the historical interaction information by taking the current instruction information as a searching basis,
an execution time length information verification sub-module for verifying the current instruction information by utilizing the searched historical execution time length of the historical instruction information to obtain a first verification result,
an operation state information checking sub-module for checking the current state information of the current instruction information by utilizing the searched history operation state information corresponding to the history instruction information to obtain a second checking result,
a voiceprint information verification sub-module for verifying the voiceprint information of the current instruction information by using the history sound source positioning information corresponding to the searched history instruction information to obtain a third verification result,
a sound source positioning information verification sub-module for verifying the sound source positioning information of the current instruction information by utilizing the searched historical voiceprint information corresponding to the historical instruction information to obtain a fourth verification result,
The verification module further comprises:
and the voting submodule is used for determining voting results of all the verification results.
Referring to fig. 6, fig. 6 is another schematic diagram of a voice man-machine interaction device according to an embodiment of the present application, in which a dashed line indicates use of the voice man-machine interaction device in model training. In this embodiment, the verification module is configured to verify the current instruction information, the running state, the sound source positioning information, and the voiceprint information by using the trained neural network model for information verification.
The neural network model is trained by utilizing positive and negative sample data recorded by the recording module. The training may be performed periodically, e.g., periodically, or aperiodically, e.g., setting an event trigger.
Referring to fig. 7, fig. 7 is another schematic diagram of a voice man-machine interaction device according to an embodiment of the present application, in which a dashed line indicates use in model training. In this embodiment, the verification module includes:
a first neural network model sub-module for verifying sound source localization information of current instruction information,
a second neural network model sub-module for verifying voiceprint information of the current instruction information,
a third neural network model sub-module for checking the current state of the current instruction information,
The fourth neural network model submodule is used for checking the current instruction information;
and the determining module is used for determining whether to execute the current instruction according to the verification result or the weighted result of the verification result output by each neural network model.
The neural network models can respectively train by utilizing positive and negative sample data recorded by each recording module, for example, train the first neural network model sub-module by utilizing sound source positioning positive and negative sample data recorded by the sound source positioning information recording sub-module, train the second neural network model sub-module by utilizing sound print positive and negative sample data recorded by the sound print information recording sub-module, train the third neural network model sub-module by utilizing state positive and negative sample data recorded by the running state recording sub-module, and train the fourth neural network model sub-module by utilizing execution duration positive and negative sample data recorded by the execution duration recording sub-module.
Each training may be performed with a period set separately, or may be performed aperiodically.
Referring to fig. 8, fig. 8 is another schematic diagram of a voice man-machine interaction device according to an embodiment of the application. The device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to realize the steps of the voice man-machine interaction method according to the embodiment of the application.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The embodiment of the application also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program realizes the steps of the voice man-machine interaction method according to the embodiment of the application when being executed by a processor.
For the apparatus/network side device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and the relevant points are referred to in the description of the method embodiment.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims (10)

1. A voice man-machine interaction method is characterized by comprising the following steps:
Acquiring current voice signal and detecting voice, and performing voice content recognition on the detected voice under the condition of detecting voice, wherein the current voice signal does not comprise any wake-up word,
in the case that the recognized speech content includes instruction words in a currently allowed instruction word set, generating a current instruction based on the included instruction words, wherein the currently allowed instruction word set is determined according to a current operation state of the controlled device,
and checking the current instruction by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: instruction information generated in the process of voice man-machine interaction in the past,
determining whether to execute the current instruction according to the first checking result;
wherein,,
the verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information comprises the following steps:
searching the historical instruction information which is the same as the current instruction information in the historical interaction information by taking the current instruction information as a searching basis,
and checking the current instruction information by utilizing the historical execution time length of the searched historical instruction information to obtain a first checking result.
2. The method of voice human-computer interaction of claim 1, wherein,
the determining whether to execute the current instruction according to the first checking result comprises the following steps:
determining whether to execute the current instruction according to the first checking result, and recording the current instruction information as the interactive information;
the recording the current instruction information as the current interaction information comprises the following steps:
recording the current instruction information and the execution result of the current instruction as the interactive information,
wherein,,
the execution result of the current instruction at least comprises: the execution duration in the case where the current instruction is executed;
and the execution time length is determined according to the time interval between the instruction executed by the current interaction and the instruction executed by the last interaction.
3. The voice man-machine interaction method of claim 2, wherein the verifying the current instruction using the same historical instruction information as the current instruction information in the historical interaction information comprises:
checking the current state information of the current instruction information by utilizing the historical running state information corresponding to the searched historical instruction information to obtain a second checking result;
the determining whether to execute the current instruction according to the first checking result further includes:
Determining whether to execute the current instruction according to the voting results of the first checking result and the second checking result,
and the current running state is used as the interactive information to record.
4. A method of voice human-computer interaction as claimed in any one of claims 2 or 3, wherein in the event that voice is detected, further comprising:
performing sound source localization on the current voice signal to obtain current sound source localization information; and/or
Acquiring current voiceprint information of a current voice signal;
the verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information comprises the following steps:
checking the current instruction information by utilizing the historical sound source positioning information corresponding to the searched historical instruction information to obtain a third checking result; and/or
Checking the current instruction information by utilizing the historical voiceprint information corresponding to the searched historical instruction information to obtain a fourth checking result;
the determining whether to execute the current instruction according to the first checking result further includes:
determining whether to execute the current instruction according to the voting result of each checking result,
and recording the current sound source positioning information and/or the current voiceprint information as the interactive information.
5. The voice man-machine interaction method of claim 4, wherein the verifying the current instruction information using the historical operation state information corresponding to the searched historical instruction information comprises:
calculating the proportion of the historical operation state information which is the same as the current state information in the searched historical instruction information in all the historical operation states, wherein the larger the proportion value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the execution duration of the searched historical instruction information comprises the following steps:
counting the average value of the execution time of the searched historical instruction information, wherein the larger the average value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the historical sound source positioning information corresponding to the searched historical instruction information comprises the following steps:
carrying out similarity calculation on the current sound source positioning information and each historical sound source positioning information, and calculating an average value of each similarity, wherein the larger the average value is, the larger the confidence of the current instruction information is;
the verifying the current instruction information by using the historical voiceprint information corresponding to the searched historical instruction information comprises the following steps:
And carrying out Euclidean distance calculation on the current voiceprint information and each piece of historical voiceprint information, and calculating the average value of each Euclidean distance, wherein the smaller the average value is, the larger the confidence of the current instruction information is.
6. The voice man-machine interaction method of claim 4, wherein determining whether to execute the current instruction according to the voting result of each verification result comprises:
if the average value of the execution time length is larger than the set first threshold value, the first voting result given to the first check result is valid,
if the average value of the similarity is larger than the set second threshold value, the third voting result given to the third checking result is valid, and/or if the average value of the Euclidean distance is larger than the set third threshold value, the fourth voting result given to the fourth checking result is valid,
and counting each effective voting result, and triggering to execute the current instruction under the condition that the number of the effective voting results is larger than a set number threshold value.
7. The voice man-machine interaction method of claim 2, wherein the determining whether to execute the current instruction according to the voting result of the first check result and the second check result comprises:
if the average value of the execution time length is larger than the set first threshold value, the first voting result given to the first check result is valid,
If the proportion value of the current state in all the historical operating states is larger than the set second threshold value, the second voting result given to the second checking result is valid,
and counting each effective voting result, and triggering to execute the current instruction under the condition that the number of the effective voting results is larger than a set number threshold value.
8. The voice man-machine interaction method of claim 1, further comprising:
checking whether the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction is larger than a set interval threshold value,
if so, taking the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction as the execution time of the last instruction, marking the recorded last interaction information as a positive sample,
otherwise, taking the time interval information between the instruction executed by the current interaction and the instruction executed by the last interaction as the execution time of the last instruction, marking the recorded last interaction information as a negative sample, or deleting the last interaction information;
training a neural network model for information verification by using the recorded positive sample and negative sample to obtain a trained neural network model, or respectively training at least one neural network model of a first neural network model for sound source positioning information verification, a second neural network model for voiceprint information verification, a third neural network model for running state information verification and a fourth neural network model for current instruction information self verification to obtain each trained neural network model;
The verifying the current instruction by using the historical instruction information which is the same as the current instruction information in the historical interaction information comprises the following steps:
and verifying at least one of the current sound source positioning information, the current voiceprint information, the current running state information and the current instruction information by the trained neural network model.
9. A voice man-machine interaction device, characterized in that the interaction device comprises:
a detection module for acquiring a current voice signal and detecting voice, performing voice content recognition on the detected voice under the condition that the voice is detected, generating a current instruction based on the included instruction word under the condition that the recognized voice content includes the instruction word in the current allowed instruction word set, wherein the current voice signal does not include any wake-up word in the current voice signal, the current allowed instruction word set is determined according to the current running state of the controlled equipment,
the verification module is used for verifying the current instruction by utilizing the historical instruction information which is the same as the current instruction information in the historical interaction information, wherein the historical interaction information comprises: instruction information generated in the process of voice man-machine interaction in the past,
The determining module is used for determining whether to execute the current instruction according to the checking result;
the verification module is configured to: searching the historical instruction information which is the same as the current instruction information in the historical interaction information by taking the current instruction information as a searching basis,
and checking the current instruction information by utilizing the historical execution time length of the searched historical instruction information to obtain a first checking result.
10. An electronic device comprising a memory storing a computer program and a processor configured to perform the steps of the voice human-machine interaction method of any of claims 1 to 8.
CN202310843070.2A 2023-07-11 2023-07-11 Voice man-machine interaction method and device Active CN116564316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310843070.2A CN116564316B (en) 2023-07-11 2023-07-11 Voice man-machine interaction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310843070.2A CN116564316B (en) 2023-07-11 2023-07-11 Voice man-machine interaction method and device

Publications (2)

Publication Number Publication Date
CN116564316A CN116564316A (en) 2023-08-08
CN116564316B true CN116564316B (en) 2023-11-03

Family

ID=87490190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310843070.2A Active CN116564316B (en) 2023-07-11 2023-07-11 Voice man-machine interaction method and device

Country Status (1)

Country Link
CN (1) CN116564316B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312230A (en) * 2019-11-27 2020-06-19 南京创维信息技术研究院有限公司 Voice interaction monitoring method and device for voice conversation platform
CN111949240A (en) * 2019-05-16 2020-11-17 阿里巴巴集团控股有限公司 Interaction method, storage medium, service program, and device
CN112164400A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112201246A (en) * 2020-11-19 2021-01-08 深圳市欧瑞博科技股份有限公司 Intelligent control method and device based on voice, electronic equipment and storage medium
CN113656679A (en) * 2021-08-27 2021-11-16 支付宝(杭州)信息技术有限公司 User searching method and device
CN114155854A (en) * 2021-12-13 2022-03-08 海信视像科技股份有限公司 Voice data processing method and device
CN114172997A (en) * 2021-11-11 2022-03-11 Oppo广东移动通信有限公司 Voice interaction method and device, electronic equipment and computer readable storage medium
CN115731923A (en) * 2021-08-26 2023-03-03 华为技术有限公司 Command word response method, control equipment and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110054899A1 (en) * 2007-03-07 2011-03-03 Phillips Michael S Command and control utilizing content information in a mobile voice-to-speech application

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949240A (en) * 2019-05-16 2020-11-17 阿里巴巴集团控股有限公司 Interaction method, storage medium, service program, and device
CN111312230A (en) * 2019-11-27 2020-06-19 南京创维信息技术研究院有限公司 Voice interaction monitoring method and device for voice conversation platform
CN112164400A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
WO2022057152A1 (en) * 2020-09-18 2022-03-24 广州橙行智动汽车科技有限公司 Voice interaction method, server, and computer-readable storage medium
CN112201246A (en) * 2020-11-19 2021-01-08 深圳市欧瑞博科技股份有限公司 Intelligent control method and device based on voice, electronic equipment and storage medium
CN115731923A (en) * 2021-08-26 2023-03-03 华为技术有限公司 Command word response method, control equipment and device
CN113656679A (en) * 2021-08-27 2021-11-16 支付宝(杭州)信息技术有限公司 User searching method and device
CN114172997A (en) * 2021-11-11 2022-03-11 Oppo广东移动通信有限公司 Voice interaction method and device, electronic equipment and computer readable storage medium
CN114155854A (en) * 2021-12-13 2022-03-08 海信视像科技股份有限公司 Voice data processing method and device

Also Published As

Publication number Publication date
CN116564316A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
US10013985B2 (en) Systems and methods for audio command recognition with speaker authentication
KR101699720B1 (en) Apparatus for voice command recognition and method thereof
CN108538293B (en) Voice awakening method and device and intelligent device
US20160180838A1 (en) User specified keyword spotting using long short term memory neural network feature extractor
US11430449B2 (en) Voice-controlled management of user profiles
EP3682443B1 (en) Voice-controlled management of user profiles
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN109410952A (en) A kind of voice awakening method, apparatus and system
CN111161728B (en) Awakening method, awakening device, awakening equipment and awakening medium of intelligent equipment
JP7516571B2 (en) Hotword threshold auto-tuning
CN110544468B (en) Application awakening method and device, storage medium and electronic equipment
US10950221B2 (en) Keyword confirmation method and apparatus
US20240013784A1 (en) Speaker recognition adaptation
CN117636872A (en) Audio processing method, device, electronic equipment and readable storage medium
CN110718217B (en) Control method, terminal and computer readable storage medium
CN116648743A (en) Adapting hotword recognition based on personalized negation
CN110580897B (en) Audio verification method and device, storage medium and electronic equipment
CN112185425B (en) Audio signal processing method, device, equipment and storage medium
CN118020100A (en) Voice data processing method and device
CN116564316B (en) Voice man-machine interaction method and device
CN115881126B (en) Switch control method and device based on voice recognition and switch equipment
US20230113883A1 (en) Digital Signal Processor-Based Continued Conversation
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
CN110334244B (en) Data processing method and device and electronic equipment
CN111078890B (en) Raw word collection method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant