CN114678019A - Intelligent device interaction method and device, storage medium and electronic device - Google Patents

Intelligent device interaction method and device, storage medium and electronic device Download PDF

Info

Publication number
CN114678019A
CN114678019A CN202210178420.3A CN202210178420A CN114678019A CN 114678019 A CN114678019 A CN 114678019A CN 202210178420 A CN202210178420 A CN 202210178420A CN 114678019 A CN114678019 A CN 114678019A
Authority
CN
China
Prior art keywords
awakening
content
voice
data
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210178420.3A
Other languages
Chinese (zh)
Inventor
廖柏锠
廖加威
任晓华
黄晓琳
赵慧斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210178420.3A priority Critical patent/CN114678019A/en
Publication of CN114678019A publication Critical patent/CN114678019A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure provides an intelligent device interaction method, an intelligent device interaction device, a storage medium and an electronic device, and relates to the technical field of intelligent recognition, in particular to the technical field of voice recognition and image recognition. The specific implementation scheme is as follows: detecting an environment volume value of an environment where a target device is located; if the environment volume value reaches a preset threshold value, executing auxiliary identification in the process of identifying the awakening keyword to obtain auxiliary identification data, wherein the awakening keyword is used for awakening the target equipment; and if the target equipment cannot be awakened by adopting the awakening keyword, awakening the target equipment according to the auxiliary identification data.

Description

Intelligent device interaction method and device, storage medium and electronic device
Technical Field
The present disclosure relates to the field of intelligent recognition technologies, and in particular, to the field of voice recognition and image recognition technologies, and in particular, to an intelligent device interaction method and apparatus, a storage medium, and an electronic device.
Background
At present, many fields start to wake up the smart device by using voice wake-up, but in some cases, for example, in a noisy public area, the voice wake-up may not recognize the voice interaction information due to the noise nearby, and thus cannot wake up the smart device. Furthermore, it is also difficult for users with language impairments, such as users with local accents, to wake up the smart device effectively.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The disclosure provides an interaction method and device for intelligent equipment, a storage medium and electronic equipment.
According to an aspect of the present disclosure, there is provided an intelligent device interaction method, including: detecting an environmental volume value of an environment where a target device is located; if the environment volume value is detected to reach a preset threshold value, executing auxiliary identification in the process of identifying an awakening keyword to obtain auxiliary identification data, wherein the awakening keyword is used for awakening the target equipment; and if the target equipment cannot be awakened by adopting the awakening keyword, awakening the target equipment according to the auxiliary identification data.
According to another aspect of the present disclosure, there is provided a smart device interaction apparatus, including: the detection module is used for detecting the environmental volume value of the environment where the target equipment is located; the acquisition module is used for executing auxiliary identification in the process of identifying the awakening keyword to obtain auxiliary identification data if the environment volume value is detected to reach a preset threshold value, wherein the awakening keyword is used for awakening the target equipment; and the awakening module is used for awakening the target equipment according to the auxiliary identification data if the target equipment cannot be awakened by adopting the awakening keyword.
According to another aspect of the present disclosure, there is provided the above-mentioned auxiliary recognition at least including: lip language identification; the above-mentioned acquisition module includes: the first obtaining sub-module is used for obtaining the awakening key words by identifying the voice content output by the awakening object; a second obtaining sub-module, configured to identify a facial region of the awakening object to obtain facial feature information, where the facial feature information at least includes: lip-like characteristics when lip sounds; and the third acquisition submodule is used for executing the lip language recognition according to the lip characteristics to obtain the auxiliary recognition data.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform any one of the smart device interaction methods.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform any one of the smart device interaction methods described above.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements any of the above-described smart device interaction methods.
According to another aspect of the present disclosure, a smart device interaction product is provided, comprising an electronic device as described above.
The embodiment of the disclosure can improve the awakening efficiency of the intelligent device.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
Fig. 1 is a flowchart of a smart device interaction method according to a first embodiment of the present disclosure;
FIG. 2 is a flow chart of an alternative smart device interaction method according to a first embodiment of the present disclosure;
FIG. 3 is a flow chart of another alternative smart device interaction method according to a first embodiment of the present disclosure;
FIG. 4 is a flow chart of another alternative smart device interaction method according to the first embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an intelligent device interaction apparatus according to a second embodiment of the present disclosure;
Fig. 6 is a block diagram of an electronic device for implementing the smart device interaction method according to the first embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In the prior art, fixed voice wake-up words are mainly used for waking up the smart device, such as the degree of smallness, Hey Siri, and the like, and the smart device can also be woken up by clicking a specific button in the screen. However, when the intelligent device is awakened by adopting the fixed voice awakening words, the content spoken by the user is only recognized in a voice mode in the radio reception process, and the accuracy needs to be improved; under the conditions of noisy environment and inaccurate pronunciation (with accent) of a user, the radio reception effect needs to be improved, and under the condition that multiple persons speak simultaneously, the voice content of a single user cannot be accurately identified, so that the intelligent equipment cannot be awakened.
In accordance with an embodiment of the present disclosure, there is provided an embodiment of a smart device interaction method, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of a smart device interaction method according to a first embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:
Step S102, detecting an environment volume value of an environment where target equipment is located;
step S104, if the environment volume value is detected to reach a preset threshold value, executing auxiliary identification in the process of identifying the awakening keyword to obtain auxiliary identification data, wherein the awakening keyword is used for awakening the target equipment;
and step S106, if the target equipment can not be awakened by adopting the awakening keyword, awakening the target equipment according to the auxiliary identification data.
It can be understood that whether the current target device is in a noisy environment is determined by detecting that the ambient volume value reaches a preset threshold.
Optionally, the target device may be, but not limited to, an intelligent device carrying a voice wake-up apparatus, such as an intelligent robot (e.g., an indoor and outdoor distribution robot, a multi-cabin robot, a meal delivery robot, etc.), a smart phone, a smart tablet, a smart car, and the like.
Optionally, the wake-up keyword is used to wake up the target device, for example, a degree of smallness, Hey S ir i, and the like; the auxiliary recognition may be, but not limited to, lip language recognition; the auxiliary identification data may be, but is not limited to, lip language identification data.
It should be noted that when it is detected that the ambient sound volume value of the environment where the target device is located reaches the preset threshold, it indicates that the target device is currently located in a noisy environment, at this time, the wake-up keyword cannot be accurately identified only by means of voice identification, so that auxiliary identification, such as lip language identification, is further performed in the process of identifying the wake-up keyword to obtain auxiliary identification data, and the target device is woken up according to the auxiliary identification data, so as to achieve the purpose of accurately identifying the wake-up keyword, improve the wake-up rate of the target device, and further improve the user experience.
In the embodiment of the disclosure, the target device is detected by detecting the ambient volume value of the environment where the target device is located; if the environment volume value is detected to reach a preset threshold value, executing auxiliary identification in the process of identifying an awakening keyword to obtain auxiliary identification data, wherein the awakening keyword is used for awakening the target equipment; if the target device cannot be awakened by adopting the awakening keyword, the target device is awakened according to the auxiliary identification data, and the purpose of awakening the target device by adopting multiple interaction modes is achieved, so that the technical effects of improving the identification accuracy rate of the interaction information and improving the awakening efficiency are achieved, and the technical problems of low identification accuracy rate and poor awakening effect existing in the method for awakening the intelligent device only by adopting voice interaction in the prior art are solved.
As an alternative embodiment, fig. 2 is a flowchart of an alternative intelligent device interaction method according to a first embodiment of the present disclosure, and as shown in fig. 2, the assisted identification at least includes: lip language identification; the above-mentioned in-process at discernment awaken key word carries out supplementary discernment, obtains supplementary identification data, includes:
step S202, obtaining the awakening key words by identifying the voice content output by the awakening object;
Step S204, identifying the face area of the awakening object to obtain face feature information;
step S206, performing the lip language recognition according to the lip characteristics to obtain the auxiliary recognition data.
Optionally, the facial feature information at least includes: for example, when the wake-up keyword is small, the mouth shape of the user may be changed from a flat one-character shape to an o-character shape during pronunciation, and the two times are repeated, which is used as a basis for waking up the target device.
It should be noted that, in the prior art, the target device is awakened only by a way of obtaining the awakening keyword through voice recognition, and the accuracy of identification of the awakening keyword is low, so that the target device is difficult to be awakened; identifying the face area of the awakening object to obtain face feature information; the way of obtaining the auxiliary identification data by executing the lip language identification according to the lip characteristics expands the way of waking up the target device, thereby improving the possibility of waking up the target device.
In an optional embodiment, the waking up the target device according to the auxiliary identification data includes:
Step S302, determining a wake-up auxiliary word for waking up the target device according to the auxiliary identification data;
and step S304, awakening the target equipment by adopting the awakening auxiliary words.
It should be noted that, in the prior art, the target device is awakened only by a way of obtaining the awakening keyword through voice recognition, and the accuracy of identification of the awakening keyword is low, so that the target device is difficult to be awakened; by adopting the mode of awakening the target equipment by the awakening auxiliary words, the way of awakening the target equipment is expanded, and the possibility of awakening the target equipment is further improved.
As an alternative embodiment, fig. 3 is a flowchart of another alternative intelligent device interaction method according to the first embodiment of the present disclosure, and as shown in fig. 3, after waking up the target device according to the auxiliary identification data, the method further includes:
step S402, performing radio processing on first voice content output by the awakening object by adopting the target equipment to obtain first radio data;
step S404, identifying a first lip-shaped feature of the awakening object when the awakening object outputs the first voice content to obtain first lip-shaped voice data;
Step S406, respectively identifying a first text content corresponding to the first radio data and a second text content corresponding to the first lip language data to obtain a first identification result;
step S408, determining whether to continuously receive the first voice content output by the awakening object according to the first recognition result.
Optionally, a directional microphone in the target device is used to perform radio processing on the first voice content output by the awakening object, so as to obtain first radio data.
Optionally, the second text content is text content in which the first text content matches the first lip language data. After the target object is awakened, a directional microphone in the target device is used for carrying out sound reception processing on the awakened object, and whether lip-shaped content is consistent with sound reception content or not is synchronously detected.
It should be noted that different characters have different lip-shaped features, different mouth shapes correspond to different lip sounds and lip-tooth sounds, in the process of converting the voice into the characters, feature matching is performed on the mouth shapes of the lip sounds and the lip-tooth sounds, the lip sounds and the lip-tooth sounds corresponding to different character pronunciations are different, for example, the bo of the doctor is the lip sounds, and the purpose of improving the voice recognition accuracy can be further achieved through double comparison of the lip-shaped features and the voice recognition results.
In an optional embodiment, the determining whether to continuously receive the first voice content output by the awakened object according to the first recognition result includes:
step S502, if the first recognition result indicates that the first text content is consistent with the second text content, determining to continuously receive the first voice content until the voice recognition process is ended;
step S504, if the recognition result indicates that the first text content and the second text content are inconsistent, it is determined that the first speech content does not need to be continuously received, and other speaking objects except the awakening object are searched within a predetermined range.
Optionally, if the first recognition result indicates that the first text content and the second text content are consistent, it is determined that the voice recognition result of the awakening object is accurate, and at this time, the awakening object is continuously received.
Optionally, in the process of continuously receiving the first voice content, lip language data corresponding to the first voice content may be simultaneously obtained and continuously matched, so as to achieve the purpose of accurately obtaining the voice interaction signal sent by the awakening object.
It should be noted that, when the recognition result indicates that the first text content and the second text content are inconsistent, it indicates that there is a deviation between the first text content obtained through the speech recognition and the second text content obtained through the lip recognition, and at this time, the specific meaning that the current awakening object wants to express cannot be accurately known, so that it is not necessary to continuously receive the sound of the first speech content, and other speaking objects except the awakening object should be searched within a predetermined range by using related devices (such as an image acquisition device).
As an alternative embodiment, fig. 4 is a flowchart of another alternative smart device interaction method according to the first embodiment of the disclosure, and as shown in fig. 4, after searching for other speaking objects except the awakening object within a predetermined range, the method further includes:
step S602, carrying out radio processing on the second voice content output by the other speaking objects to obtain second radio data;
step S604, identifying a second mouth shape characteristic of the other speaking objects when the other speaking objects output the second voice content to obtain second lip language data;
step S606, respectively identifying a third text content corresponding to the second radio data and a fourth text content corresponding to the second lip language data to obtain a second identification result;
Step S608, if the second recognition result indicates that the third text content and the fourth text content are consistent, determining to continuously receive the second voice content output by the other speaking object until the voice recognition process is ended.
Optionally, under the condition that it is determined that the first text content and the second text content output by the awakening object are not consistent, other speaking objects except the awakening object are searched within a predetermined range, second radio data and second lip language data output by the other speaking objects are obtained, and under the condition that the text contents corresponding to the second radio data and the second lip language data (i.e., the third text content and the second lip language data) are consistent, the second speech content output by the other speaking objects is continuously received until the speech recognition process is ended, so that the content which the speaker wants to express is accurately recognized, and the speech interaction accuracy is improved.
In an optional embodiment, the method further comprises:
step S702, if after a preset time period, the first identification result and the second identification result are determined to be inconsistent, displaying a plurality of character contents to be selected on a display interface;
Step S704, responding to the click operation of the awakening object or the other speaking objects, and selecting correct expression contents in the plurality of character contents to be selected;
step S706, controlling the target device to perform a feedback operation according to the correct expression content.
Optionally, the text content to be selected includes: the first text content, the second text content, the third text content, and the fourth text content.
Optionally, the target device is controlled to execute the feedback operation of correctly expressing the content in a voice broadcast or text prompt mode.
It should be noted that after a predetermined time period continues, it is determined that the first recognition result and the second recognition result are both unable to indicate the same, multiple pieces of text content to be selected (i.e., the first text content, the second text content, the third text content, and the fourth text content) are displayed on a display interface, a user may select a correctly expressed content from the multiple pieces of text content to be selected by clicking the display interface, and after obtaining input information of the user, the target device feeds back an input result of the user by voice broadcast or problem prompt, so as to ensure that the user can obtain a desired interaction result, and improve user experience.
It should be noted that, for optional or preferred embodiments of the present embodiment, reference may be made to the related description in the above vehicle information prompting method embodiment, and details are not repeated here. In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the customs of public sequences.
Example 2
According to an embodiment of the present disclosure, an apparatus embodiment for implementing the intelligent device interaction method is further provided, and fig. 5 is a schematic structural diagram of an intelligent device interaction apparatus according to a second embodiment of the present disclosure, as shown in fig. 5, the intelligent device interaction apparatus includes: detection module 40, acquisition module 42, wake-up module 44, wherein:
the detection module 40 is configured to detect an ambient volume value of an environment where the target device is located;
the obtaining module 42 is configured to, if it is detected that the ambient sound volume value reaches a preset threshold, perform auxiliary identification in a process of identifying a wake-up keyword to obtain auxiliary identification data, where the wake-up keyword is used to wake up the target device;
the wake-up module 44 is configured to wake up the target device according to the auxiliary identification data if the target device cannot be woken up by using the wake-up keyword.
In the embodiment of the present disclosure, the detection module 40 is configured to detect an environmental volume value of an environment where the target device is located; the obtaining module 42 is configured to, if it is detected that the ambient sound volume value reaches a preset threshold, perform auxiliary identification in a process of identifying a wake-up keyword to obtain auxiliary identification data, where the wake-up keyword is used to wake up the target device; the awakening module 44 is configured to awaken the target device according to the auxiliary identification data if the target device cannot be awakened by using the awakening keyword, so as to achieve the purpose of awakening the target device by using multiple interaction modes, thereby achieving the technical effects of improving the identification accuracy of the interaction information and improving the awakening efficiency, and further solving the technical problems of low identification accuracy and poor awakening effect of the method in the prior art that only voice interaction is used for awakening the intelligent device. It should be noted that the above modules may be implemented by software or hardware, for example, for the latter, the following may be implemented: the modules can be located in the same processor; alternatively, the modules may be located in different processors in any combination.
It should be noted that the detecting module 40, the obtaining module 42, and the waking module 44 correspond to steps S102 to S106 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above may be executed in a computer terminal as part of an apparatus.
Optionally, the auxiliary identification at least includes: lip language identification; the above-mentioned acquisition module includes: the first obtaining sub-module is used for obtaining the awakening key words by identifying the voice content output by the awakening object; a second obtaining sub-module, configured to identify a facial region of the awakening object to obtain facial feature information, where the facial feature information at least includes: lip-like characteristics when lip sounds; and the third acquisition submodule is used for executing the lip language recognition according to the lip characteristics to obtain the auxiliary recognition data.
Optionally, the wake-up module includes: a first determining module, configured to determine a wake-up auxiliary word for waking up the target device according to the auxiliary identification data; and the first awakening submodule is used for awakening the target equipment by adopting the awakening auxiliary word.
Optionally, the apparatus further comprises: the fourth acquisition submodule is used for performing radio processing on the first voice content output by the awakening object by adopting the target equipment to obtain first radio data; a fifth obtaining submodule, configured to identify a first lip feature of the awakening object when the first voice content is output, so as to obtain first lip voice data; a sixth obtaining submodule, configured to respectively identify a first text content corresponding to the first radio data and a second text content corresponding to the first lip language data, so as to obtain a first identification result; and the second determining module is used for determining whether to continuously receive the first voice content output by the awakening object according to the first recognition result.
Optionally, the second determining module includes: a third determining module, configured to determine to continuously receive the first voice content until the voice recognition process is ended if the first recognition result indicates that the first text content and the second text content are consistent; a fourth determining module, configured to determine that continuous sound reception is not required for the first voice content if the recognition result indicates that the first text content and the second text content are inconsistent, and search for other speaking objects except the awakening object within a predetermined range.
Optionally, the apparatus further comprises: the seventh acquiring submodule is used for performing radio processing on the second voice content output by the other speaking objects to obtain second radio data; the eighth acquiring submodule is used for identifying a second mouth shape characteristic of the other speaking objects when the other speaking objects output the second voice content to acquire second lip language data; a ninth obtaining sub-module, configured to respectively identify a third text content corresponding to the second radio data and a fourth text content corresponding to the second lip language data, so as to obtain a second identification result; and a fifth determining module, configured to determine to continuously receive the second voice content output by the other speaking object until the voice recognition process is ended if the second recognition result indicates that the third text content and the fourth text content are consistent.
Optionally, the apparatus further comprises: a sixth determining module, configured to display a plurality of text contents to be selected on a display interface if it is determined that the first recognition result and the second recognition result are both inconsistent after lasting for a predetermined time period, where the text contents to be selected include: the first text content, the second text content, the third text content, and the fourth text content; a selecting module, configured to select a correct expression content from the multiple text contents to be selected in response to a click operation of the awakening object or the other speaking object; and the control module is used for controlling the target equipment to execute feedback operation according to the correct expression content.
It should be noted that, reference may be made to the relevant description in embodiment 1 for alternative or preferred embodiments of this embodiment, and details are not described here again. In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
Example 3
The present disclosure also provides an electronic device, a readable storage medium, a computer program product, and a smart device interaction product, according to embodiments of the present disclosure.
Fig. 6 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 801 performs the various methods and processes described above, such as the method detecting an ambient volume value of the environment in which the target device is located. For example, in some embodiments, the method detects an ambient volume value of an environment in which the target device is located can be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by the computing unit 801, the computer program may perform one or more of the steps of the method described above for detecting an ambient volume value of an environment in which the target device is located. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g. by means of firmware) to perform the method to detect an ambient sound value of the environment in which the target device is located.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (18)

1. An intelligent device interaction method comprises the following steps:
detecting an environment volume value of an environment where a target device is located;
if the environment volume value is detected to reach a preset threshold value, executing auxiliary identification in the process of identifying an awakening keyword to obtain auxiliary identification data, wherein the awakening keyword is used for awakening the target equipment;
and if the target equipment cannot be awakened by adopting the awakening keyword, awakening the target equipment according to the auxiliary identification data.
2. The method of claim 1, wherein the assisted identification comprises at least: lip language identification; the step of executing auxiliary identification in the process of identifying the awakening keyword to obtain auxiliary identification data comprises the following steps:
obtaining the awakening key word by identifying the voice content output by the awakening object;
identifying the facial area of the awakening object to obtain facial feature information, wherein the facial feature information at least comprises: lip-like characteristics when lip sounds are emitted;
and executing the lip language recognition according to the lip characteristics to obtain the auxiliary recognition data.
3. The method of claim 1, wherein the waking up the target device in accordance with the secondary identification data comprises:
determining a wake-up auxiliary word for waking up the target device according to the auxiliary identification data;
and awakening the target equipment by adopting the awakening auxiliary words.
4. The method of claim 1, wherein after waking the target device in accordance with the secondary identification data, the method further comprises:
performing radio processing on first voice content output by the awakening object by adopting the target equipment to obtain first radio data;
Identifying a first lip-shaped feature of the awakening object when the awakening object outputs the first voice content to obtain first lip-language data;
respectively identifying a first text content corresponding to the first radio data and a second text content corresponding to the first lip language data to obtain a first identification result;
and determining whether to continuously receive the first voice content output by the awakening object according to the first recognition result.
5. The method of claim 4, wherein the determining whether to continuously sound the first voice content output by the wake object according to the first recognition result comprises:
if the first recognition result indicates that the first text content and the second text content are consistent, determining to continuously receive the first voice content until the voice recognition process is ended;
and if the recognition result indicates that the first text content and the second text content are inconsistent, determining that continuous sound reception is not needed for the first voice content, and searching other speaking objects except the awakening object within a preset range.
6. The method of claim 5, wherein after finding other speaking subjects than the awakening subject within a predetermined range, the method further comprises:
Performing radio processing on second voice content output by the other speaking objects to obtain second radio data;
identifying a second mouth shape characteristic of the other speaking objects when the other speaking objects output the second voice content to obtain second lip language data;
respectively identifying third text content corresponding to the second radio data and fourth text content corresponding to the second lip language data to obtain a second identification result;
and if the second recognition result indicates that the third text content and the fourth text content are consistent, determining to continuously receive the second voice content output by the other speaking objects until the voice recognition process is ended.
7. The method of claim 6, wherein the method further comprises:
if the first recognition result and the second recognition result are determined to be inconsistent after lasting for a preset time period, displaying a plurality of text contents to be selected on a display interface, wherein the text contents to be selected comprise: the first textual content, the second textual content, the third textual content, and the fourth textual content;
responding to the click operation of the awakening object or the other speaking objects, and selecting correct expression contents in the multiple character contents to be selected;
And controlling the target equipment to execute feedback operation according to the correct expression content.
8. An intelligent device interaction apparatus, comprising:
the detection module is used for detecting the environmental volume value of the environment where the target equipment is located;
the acquisition module is used for executing auxiliary identification in the process of identifying the awakening keyword to obtain auxiliary identification data if the environmental volume value is detected to reach a preset threshold value, wherein the awakening keyword is used for awakening the target equipment;
and the awakening module is used for awakening the target equipment according to the auxiliary identification data if the target equipment cannot be awakened by adopting the awakening keyword.
9. The apparatus of claim 8, wherein the assisted identification comprises at least: lip language identification; the acquisition module includes:
the first obtaining sub-module is used for obtaining the awakening key words by identifying the voice content output by the awakening object;
a second obtaining sub-module, configured to identify a facial region of the awakening object to obtain facial feature information, where the facial feature information at least includes: lip-like characteristics when lip sounds;
and the third acquisition submodule is used for executing the lip language recognition according to the lip characteristics to obtain the auxiliary recognition data.
10. The apparatus of claim 8, wherein the wake-up module comprises:
a first determining module, configured to determine, according to the auxiliary identification data, an awakening auxiliary word for awakening the target device;
and the first awakening submodule is used for awakening the target equipment by adopting the awakening auxiliary word.
11. The apparatus of claim 8, wherein the apparatus further comprises:
the fourth acquisition submodule is used for performing radio processing on the first voice content output by the awakening object by adopting the target equipment to obtain first radio data;
the fifth obtaining submodule is used for identifying a first lip-shaped feature of the awakening object when the first voice content is output to obtain first lip-shaped voice data;
a sixth obtaining submodule, configured to respectively identify a first text content corresponding to the first radio data and a second text content corresponding to the first lip language data, so as to obtain a first identification result;
and the second determining module is used for determining whether to continuously receive the first voice content output by the awakening object according to the first recognition result.
12. The apparatus of claim 11, wherein the second determining means comprises:
A third determining module, configured to determine to continuously receive the first voice content until a voice recognition process is ended if the first recognition result indicates that the first text content and the second text content are consistent;
a fourth determining module, configured to determine that continuous reception is not required for the first voice content and that other speaking objects except the awakening object are searched within a predetermined range if the recognition result indicates that the first text content and the second text content are inconsistent.
13. The apparatus of claim 12, wherein the apparatus further comprises:
the seventh acquiring submodule is used for performing radio processing on the second voice content output by the other speaking objects to obtain second radio data;
the eighth acquiring submodule is used for identifying second mouth shape characteristics of the other speaking objects when the other speaking objects output the second voice content to obtain second lip language data;
a ninth obtaining sub-module, configured to respectively identify a third text content corresponding to the second radio data and a fourth text content corresponding to the second lip language data, so as to obtain a second identification result;
and a fifth determining module, configured to determine to continuously receive the second voice content output by the other speaking object until the voice recognition process is ended if the second recognition result indicates that the third text content and the fourth text content are consistent.
14. The apparatus of claim 13, wherein the apparatus further comprises:
a sixth determining module, configured to display a plurality of text contents to be selected on a display interface if it is determined that the first recognition result and the second recognition result are both inconsistent after lasting for a predetermined time period, where the text contents to be selected include: the first textual content, the second textual content, the third textual content, and the fourth textual content;
the selecting module is used for responding to the clicking operation of the awakening object or the other speaking objects and selecting the correct expression content in the plurality of character contents to be selected;
and the control module is used for controlling the target equipment to execute feedback operation according to the correct expression content.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the smart device interaction method of any of claims 1-7.
16. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the smart device interaction method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the smart device interaction method of any one of claims 1-7.
18. A smart device interaction product comprising the electronic device of claim 15.
CN202210178420.3A 2022-02-24 2022-02-24 Intelligent device interaction method and device, storage medium and electronic device Pending CN114678019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210178420.3A CN114678019A (en) 2022-02-24 2022-02-24 Intelligent device interaction method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210178420.3A CN114678019A (en) 2022-02-24 2022-02-24 Intelligent device interaction method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN114678019A true CN114678019A (en) 2022-06-28

Family

ID=82072523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210178420.3A Pending CN114678019A (en) 2022-02-24 2022-02-24 Intelligent device interaction method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114678019A (en)

Similar Documents

Publication Publication Date Title
US11664027B2 (en) Method of providing voice command and electronic device supporting the same
CN110199350B (en) Method for sensing end of speech and electronic device implementing the method
CN105810188B (en) Information processing method and electronic equipment
CN108496220B (en) Electronic equipment and voice recognition method thereof
CN107895573B (en) Method and device for identifying information
CN109215660A (en) Text error correction method and mobile terminal after speech recognition
CN112509566A (en) Voice recognition method, device, equipment, storage medium and program product
CN109688271A (en) The method, apparatus and terminal device of contact information input
US11948567B2 (en) Electronic device and control method therefor
CN113903329B (en) Voice processing method and device, electronic equipment and storage medium
CN112684936A (en) Information identification method, storage medium and computer equipment
CN114356275B (en) Interactive control method and device, intelligent voice equipment and storage medium
CN108491471B (en) Text information processing method and mobile terminal
CN110827800A (en) Voice-based gender recognition method and device, storage medium and equipment
CN114299955B (en) Voice interaction method and device, electronic equipment and storage medium
CN114678019A (en) Intelligent device interaction method and device, storage medium and electronic device
CN113554062B (en) Training method, device and storage medium for multi-classification model
CN114023303A (en) Voice processing method, system, device, electronic equipment and storage medium
CN114121022A (en) Voice wake-up method and device, electronic equipment and storage medium
CN114373458A (en) Intelligent household equipment control method and device, computer equipment and storage medium
CN114490967A (en) Training method of dialogue model, dialogue method and device of dialogue robot and electronic equipment
CN112786048A (en) Voice interaction method and device, electronic equipment and medium
CN113449197A (en) Information processing method, information processing apparatus, electronic device, and storage medium
CN113744726A (en) Voice recognition method and device, electronic equipment and storage medium
CN113129904A (en) Voiceprint determination method, apparatus, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination