CN114299940A

CN114299940A - Display device and voice interaction method

Info

Publication number: CN114299940A
Application number: CN202110577525.1A
Authority: CN
Inventors: 王峰
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2022-04-08

Abstract

The embodiment of the application provides a display device and a voice interaction method, wherein the display device comprises a display used for presenting a user interface; a controller connected with the display, the controller configured to: acquiring user identity information of a target person, and acquiring a voice real-time instruction, wherein the target person comprises a person who sends the awakening instruction or a registered user; detecting face information in an image acquired by a camera; if the face information is the face information of the target person, face tracking and lip movement detection are carried out on the target person, and if the face of the target person moves, and the voice real-time instruction comprises the voice of the target person, the voice real-time instruction is responded; and if the human face of the target person does not have lip movement or the voice real-time instruction does not comprise the voice of the target person, the voice real-time instruction is not responded. The technical problem that voice interaction experience is poor is solved.

Description

Display device and voice interaction method

Technical Field

The present application relates to the field of voice interaction technologies, and in particular, to a display device and a voice interaction method.

Background

With the rise of smart homes, controlling home devices such as smart televisions through voice interaction becomes a more and more popular control mode. The wake-up rate and the speech recognition accuracy are two important indicators that affect the user experience of speech interaction. In the early development stage of the voice interaction technology, the voice interaction is usually near-field interaction, and in a near-field voice interaction scene, the man-machine distance is small, the influence of noise interference is small, and the awakening rate and the voice recognition accuracy rate are high. However, when people watch the television, the distance from the television is usually far, near-field interaction cannot meet the requirements of people, and a far-field voice interaction technology is developed to improve the convenience of voice interaction. Under a far-field voice interaction scene, the man-machine distance is large, the influence of noise interference is also large, the awakening rate and the voice recognition accuracy rate are also reduced, and the voice interaction experience is poor.

Disclosure of Invention

In order to solve the technical problem of poor voice interaction experience, the application provides a display device and a voice interaction method.

In a first aspect, the present application provides a display device comprising:

a display for presenting a user interface;

the camera is used for collecting images;

a controller connected with the display, the controller configured to:

collecting a voice awakening instruction;

responding to the voice awakening instruction, acquiring user identity information of a target person, and acquiring a voice real-time instruction, wherein the target person comprises a person who sends the awakening instruction or a registered user;

detecting face information in an image acquired by a camera;

if the face information of the target person is detected, carrying out face tracking and lip movement detection on the target person, and if the face of the target person has lip movement and the voice real-time instruction comprises the voice of the target person, responding to the voice real-time instruction;

and if the human face of the target person does not have lip movement or the voice real-time instruction does not comprise the voice of the target person, not responding to the voice real-time instruction.

In some embodiments, detecting face information in an image captured by the camera includes:

carrying out sound source positioning on the voice awakening instruction to obtain an awakening sound source position;

and rotating the camera towards the awakening sound source position, detecting face information in an image acquired by the camera in the rotating process, and controlling the camera to stop rotating if the face information of the target person is detected.

In some embodiments of the present invention, the,

carrying out face tracking and lip movement detection on the target person, comprising the following steps:

acquiring a real-time coordinate range of the face of a target person in an image shot by the camera;

controlling the camera to rotate according to the variation trend of the real-time coordinate range, so that the face of the target person is positioned in a preset area in the image acquired by the camera;

and carrying out lip movement detection on the image of the face of the target person.

In a second aspect, the present application provides a voice interaction method, including:

collecting a voice awakening instruction;

detecting face information in an image acquired by a camera;

The display equipment and the voice interaction method have the advantages that:

according to the display equipment, after the voice awakening instruction is received, the user identity information of the awakening source is obtained, and the human face tracking is carried out on the target person corresponding to the user identity information, so that the interference of non-target persons can be eliminated; after the face of the target person is tracked, lip movement occurs according to the target person, the received voice real-time instruction comprises voice of the target person, then the voice real-time instruction is responded, no lip movement occurs in the target person, or the voice real-time instruction does not comprise the voice of the target person, the voice real-time instruction is not responded, the collection probability of noise is reduced, the voice recognition accuracy rate can be improved, and user experience is improved.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic diagram illustrating an operational scenario between a display device and a control apparatus according to some embodiments;

a block diagram of the hardware configuration of the control device 100 according to some embodiments is illustrated in fig. 2;

a block diagram of a hardware configuration of a display device 200 according to some embodiments is illustrated in fig. 3;

a schematic diagram of a software configuration in a display device 200 according to some embodiments is illustrated in fig. 4;

FIG. 5 is a schematic diagram illustrating the principles of voice interaction, according to some embodiments;

FIG. 6 is a diagram illustrating a scenario of voice interaction, according to some embodiments;

a signal processing schematic of a voice interaction according to some embodiments is illustrated in fig. 7;

FIG. 8 illustrates a timing diagram of voice interactions, according to some embodiments;

FIG. 9 illustrates an overall flow diagram of a voice interaction method according to some embodiments;

FIG. 10 is a flow diagram illustrating a method of processing a voice wake up instruction according to some embodiments;

FIG. 11 is a flow diagram that illustrates a method of processing when a single person's face is detected during a voice interaction process, in accordance with some embodiments;

fig. 12 is a flow diagram illustrating a processing method for detecting faces of multiple persons during voice interaction according to some embodiments.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the smart device 300 or the control device 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and controls the display device 200 in a wireless or wired manner. The user may input a user instruction through a key on a remote controller, voice input, control panel input, etc., to control the display apparatus 200.

In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.

In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the display device 200 to obtain a voice command, or may be received by a voice control device provided outside the display device 200.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200.

Fig. 3 shows a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment.

In some embodiments, the display apparatus 200 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, a user interface.

In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

In some embodiments, the display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the external control apparatus 100 or the server 400 through the communicator 220.

In some embodiments, the user interface may be configured to receive control signals for controlling the apparatus 100 (e.g., an infrared remote control, etc.).

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. The interface may be a composite input/output interface formed by the plurality of interfaces.

In some embodiments, the tuner demodulator 210 receives broadcast television signals via wired or wireless reception, and demodulates audio/video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other actionable control. The operations related to the selected object are: displaying an operation connected to a hyperlink page, document, image, or the like, or performing an operation of a program corresponding to the icon.

In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions receiving external input, so as to finally display and play various audio-video contents. The CPU processor may include a plurality of processors. E.g. comprising a main processor and one or more sub-processors.

In some embodiments, a graphics processor for generating various graphics objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor comprises an arithmetic unit, which performs operation by receiving various interactive instructions input by a user and displays various objects according to display attributes; the system also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, and perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, and image synthesis according to a standard codec protocol of the input signal, so as to obtain a signal that can be displayed or played on the direct display device 200.

In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the video signal after demultiplexing, including decoding, scaling and the like. And the image synthesis module is used for carrying out superposition mixing processing on the GUI signal input by the user or generated by the user and the video image after the zooming processing by the graphic generator so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received video output signal after the frame rate conversion, and changing the signal to be in accordance with the signal of the display format, such as an output RGB data signal.

In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the received audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing to obtain an audio signal that can be played in the speaker.

In some embodiments, a user may enter user commands on a Graphical User Interface (GUI) displayed on display 260, and the user input interface receives the user input commands through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

In some embodiments, a system of a display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

The system of the display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

Referring to fig. 4, in some embodiments, the system is divided into four layers, which are an Application (Applications) layer (abbreviated as "Application layer"), an Application Framework (Application Framework) layer (abbreviated as "Framework layer"), an Android runtime (Android runtime) and system library layer (abbreviated as "system runtime library layer"), and a kernel layer from top to bottom.

In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an Application Programming Interface (API) and a programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resources in the system and obtain the services of the system in execution through the API interface.

As shown in fig. 4, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage the icons, windows, toolbars, wallpapers, and desktop components on a user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the various applications as well as general navigational fallback functions, such as controlling exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of a display screen, judging whether a status bar exists, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window, displaying a shake, displaying a distortion deformation, and the like), and the like.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the core layer includes at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

The hardware or software architecture in some embodiments may be based on the description in the above embodiments, and in some embodiments may be based on other hardware or software architectures that are similar to the above embodiments, and it is sufficient to implement the technical solution of the present application.

For clarity of explanation of the embodiments of the present application, a speech recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 5, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment with voice recognition service deployed, the semantic service equipment is electronic equipment with semantic service deployed, and the business service equipment is electronic equipment with business service deployed. The electronic device may include a server, a computer, and the like, and the speech recognition service, the semantic service (also referred to as a semantic engine), and the business service are web services that can be deployed on the electronic device, wherein the speech recognition service is used for recognizing audio as text, the semantic service is used for semantic parsing of the text, and the business service is used for providing specific services such as a weather query service for ink weather, a music query service for QQ music, and the like. In one embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in fig. 5, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to a smart device based on the architecture shown in fig. 5, where the information input to the smart device is an example of a query statement input by voice, the process may include the following three processes:

[ Speech recognition ]

The intelligent device can upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input by voice, so that the voice recognition service device can recognize the audio as a text through the voice recognition service and then return the text to the intelligent device. In one embodiment, before uploading the audio of the query statement to the speech recognition service device, the smart device may perform denoising processing on the audio of the query statement, where the denoising processing may include removing echo and environmental noise.

[ semantic understanding ]

The intelligent device uploads the text of the query sentence identified by the voice identification service to the semantic service device, and the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.

[ semantic response ]

And the semantic service equipment issues a query instruction to corresponding business service equipment according to the semantic analysis result of the text of the query statement so as to obtain the query result given by the business service. The intelligent device can obtain the query result from the semantic service device and output the query result. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the intelligent device, so that the intelligent device outputs a feedback statement in the semantic parsing result.

It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the three processes can be completed by the intelligent terminal, and are not described herein.

In some embodiments, the intelligent device shown in fig. 5 may be a display device, such as a smart television, the functions of the speech recognition service device may be implemented by cooperation of a sound collector and a controller provided on the display device, and the functions of the semantic service device and the business service device may be implemented by the controller of the display device or by a server of the display device.

In some embodiments, a query statement or other interactive statement that a user enters a display device through speech may be referred to as a voice instruction.

In some embodiments, the display device obtains, from the semantic service device, a query result given by the business service, and the display device may analyze the query result to generate response data of the voice instruction, and then control the display device to execute a corresponding action according to the response data.

In some embodiments, the semantic analysis result of the voice instruction is acquired by the display device from the semantic service device, and the display device may analyze the semantic analysis result to generate response data, and then control the display device to execute a corresponding action according to the response data.

In some embodiments, a voice control button may be disposed on the remote controller of the display device, and after the user presses the voice control button on the remote controller, the controller of the display device may control the display of the display device to display the voice interaction interface and control the sound collector, such as a microphone, to collect sound around the display device. At this time, the user may input a voice instruction to the display device.

In some embodiments, the display device may support a voice wake-up function, and the sound collector of the display device may be in a state of continuously collecting sound. After the user speaks the awakening word, the display device performs voice recognition on the voice instruction input by the user, and after the voice instruction is recognized to be the awakening word, the display of the display device can be controlled to display a voice interaction interface, and at the moment, the user can continue to input the voice instruction to the display device. The wake-up word may be referred to as a voice wake-up command, and the voice command that the user continues to input may be referred to as a voice real-time command.

In some embodiments, after a user inputs a voice instruction, in a process that the display device acquires response data of the voice instruction or the display device responds according to the response data, the sound collector of the display device can keep a sound collection state, the user can press a voice control button on the remote controller at any time to re-input the voice instruction or speak a wakeup word, at this time, the display device can end a last voice interaction process, and a new voice interaction process is started according to the voice instruction newly input by the user, so that the real-time performance of voice interaction is guaranteed.

In some embodiments, when the current interface of the display device is a voice interaction interface, the display device performs voice recognition on a voice instruction input by a user to obtain a text corresponding to the voice instruction, the display device itself or a server of the display device performs semantic understanding on the text to obtain a user intention, processes the user intention to obtain a semantic analysis result, and generates response data according to the semantic analysis result.

For example, in a voice interaction mode in which the display device starts a voice conversation according to the received voice wake-up instruction, the user who sends the voice wake-up instruction may be referred to as a target person.

In some embodiments, the target person may also be a registered user on the display device, and when the user registers on the display device, voiceprint information and a face image may be entered on the display device.

In some voice interaction scenarios, due to environmental noise interference, sound interference of a non-target person, and the like, the display device may be awoken by mistake, or the intention of the target person cannot be accurately identified according to the collected audio, which may seriously affect the voice interaction experience of the user.

In order to solve the above problem, embodiments of the present application show a voice interaction scheme, which can effectively improve voice interaction experience in a complex voice interaction scene by combining a method of audio signal processing and video signal processing.

In some embodiments, to collect the video signals required during the voice interaction process, the display device may be provided with a camera, or be connected with an external camera.

Taking an example that the display device is provided with a camera, see fig. 6, a schematic diagram of a voice interaction scene according to some embodiments. As shown in fig. 6, in some embodiments, the display device 200 may be provided with a camera 201, and the camera 201 may capture an image. If the camera is fixed on the display device 200, only images within a certain field of view can be captured, such as the image within the area a in fig. 6, where the field angle of the area a may be α, and α is smaller than 180 degrees. The camera 201 can photograph the user when the user stands in the a region, and the camera 201 cannot photograph the user when the user stands in the B region, which is the left region of the a region, and the C region, which is the right region of the a region.

In some embodiments, to expand the field of view of the camera 201, the camera 201 may be provided with a pan-tilt or other structure that can adjust the field angle of the camera. The holder can adjust the view field of the camera 201, the controller of the display device can be connected with the camera, the dynamic change of the view field of the camera is controlled by the holder, so that the view field of the camera 201 after rotation can reach 0-180 degrees, users in the range of 0-180 degrees can be shot, wherein 0 degree represents that the user stands on the left side of the display device 200 and is in the same plane with the display device 200, and 180 degrees represents that the user stands on the right side of the display device 200 and is in the same plane with the display device 200. Therefore, the camera can shoot the users in the areas B and C by rotating the camera, and the effect that the users can be shot by rotating the camera 201 as long as the users stand in front of the display device 200 is achieved.

In some embodiments, the display device performs image acquisition through an external camera, which may be a camera provided with a pan-tilt, thereby enabling image acquisition of a dynamic field of view.

In some embodiments, the camera 201 itself has no pan/tilt, and may be mounted on a pan/tilt communicatively connected to the display device, and the image acquisition of the dynamic field of view may also be implemented by controlling the pan/tilt through the display device.

In some embodiments, the method for combining audio signal processing and video signal processing by a display device can be seen in fig. 7, which is a signal processing schematic diagram of a voice interaction process according to some embodiments.

As shown in fig. 7, in some embodiments, the display device derives an audio signal from the audio collected by the microphone, and the processing of the audio signal includes sound source localization, user attribute recognition, voiceprint recognition, speech recognition, and noise reduction enhancement.

Wherein, the sound source localization may include determining an angle between the audio source and the display device, referring to fig. 6, if α is 90 degrees, the wake-up angle is between 0-45 degrees when the user is located in the B area, the wake-up angle is between 45-135 degrees when the user is located in the a area, and the wake-up angle is between 135 and 180 degrees when the user is located in the C area. Sound source localization can be achieved by various algorithms, such as time-of-arrival differencing, beamforming, etc. In order to realize sound source positioning, the display equipment can be provided with a microphone array, the microphone array comprises a plurality of microphones arranged at different positions of the display equipment, each microphone is connected with a controller of the display equipment, the display equipment collects multi-channel audio signals through the plurality of microphones of the microphone array, and the multi-channel audio signals are comprehensively analyzed to obtain awakening angles. For example, in the time difference of arrival method, the display device may calculate the position of the audio source relative to the display device according to the time difference of the audio signals received by the plurality of microphones and the relative position relationship between the plurality of microphones. In the beam forming method, the display device can perform filtering, weighting and superposition on audio signals collected by each microphone to form sound pressure distribution beams, and the position of an audio source relative to the display device is obtained according to the distribution characteristics of the sound pressure beams. And obtaining the angle between the audio source and the display equipment according to the position of the audio source relative to the display equipment.

The user attributes determined by the user attribute identification may include the user's gender and the user's age, wherein the age may be an age group such as 1-10 years, 11-20 years, 20-40 years, 40-60 years, and so forth. User attribute recognition may be implemented based on a pre-trained model. By collecting a large number of audio samples with different user attributes, a model capable of predicting the user attributes can be trained based on a neural network, and the user attributes can be obtained after audio signals are input into the model.

Noise reduction enhancement may include enhancing the speech of a targeted person in an audio signal and noise reducing the audio of a non-targeted person. Noise reduction enhancement may be achieved through speech directional enhancement techniques. The voice enhancement beam is dynamically adjusted to an enhancement beam centered on the target person, so that the voice enhancement is performed on the target person, and the sound outside the beam is suppressed.

It can be seen that the processing of the audio signals enables the determination of the user's location, the determination of the user's identity and the determination of the audio content. After the user position is determined, the camera can be controlled to rotate so as to quickly position the target person. After the user identity is determined, the target person and other people who carry out voice interaction can be distinguished. The audio content is determined to be the intention of the user.

As shown in fig. 7, in some embodiments, the display device obtains a video signal from an image captured by the camera, and the processing of the video signal includes face detection and tracking, face recognition, lip motion detection, and lip language recognition.

Wherein, at face detection and tracking in-process, steerable camera rotates to ensure that the target person is in the field of view of camera all the time.

Lip movement detection can be realized in the face shot by the camera, whether the lips change or not is judged, if the lips change, the person can be determined to be speaking, and if the lips do not change, the person can be determined not to be speaking. If a person is speaking, face recognition technology may be combined to determine whether the person speaking is a target person, if the person speaking includes the target person, it may be determined that the received audio signal includes the voice of the target person, and if the person speaking does not include the target person, it may be determined that the received audio signal does not include the voice of the target person.

Lip recognition recognizes the speech content of the person who is speaking, and the speech content can be compared and analyzed with the audio content obtained by voice recognition of the audio signal, and if the speech content is consistent or approximately consistent, the audio content of the audio signal can be considered to be originated from the person in the image shot by the camera, and conversely, if the speech content is greatly different, the audio content of the audio signal can be considered to be originated from the person or the environment outside the image shot by the camera. In some embodiments, lip language recognition may not be performed, and only lip motion detection is used to determine whether the audio content of the audio signal is from a person in an image captured by a camera, so that resource occupation of video signal processing may be reduced, and processing efficiency may be improved.

It can be seen that the processing of the video signal enables tracking of the user's location and determination of whether the user is speaking.

As shown in fig. 7, in some embodiments, the display apparatus may further acquire application scene information after obtaining the processing result of the audio signal and the processing result of the video signal. The application scene information may include interaction control information preset by foreground running application, and the interaction control information may include an audio acquisition control parameter and a video acquisition control parameter, for example, a value of the audio acquisition control parameter is 1 to indicate that audio data acquisition is currently possible, a value of the audio acquisition control parameter is 0 to indicate that audio data acquisition is currently unavailable, a value of the video acquisition control parameter is 1 to indicate that video data acquisition is currently possible, and a value of the video acquisition control parameter is 0 to indicate that video data acquisition is currently unavailable.

For example, when the foreground running application is a video chat application, the audio acquisition control parameter and the video acquisition control parameter are both 1, and the fusion decision engine may perform comprehensive analysis on the processing result of the audio signal and the processing result of the video signal according to that the audio acquisition control parameter and the video acquisition control parameter are both 1, so as to obtain the recognition result.

When the foreground running application is an online teaching application, if the display device is a student end, the situation that the student is not allowed to speak may exist at some moments, and at the moment, the audio acquisition control parameter may be 0. The fusion decision engine may determine whether to adopt the processing result of the audio signal and the processing result of the video signal according to the audio acquisition control parameter and the video acquisition control parameter to obtain an identification result, or determine to control the display device to acquire audio data and video data.

In some embodiments, the processing result of the audio signal, the processing result of the video signal and the application scene information are input into a feature fusion decision engine, and the recognition result of multiple modes can be output through the feature fusion decision engine. The multi-modal recognition result may include the voice content of the audio signal, the character that is speaking, the character that is not speaking, and the correspondence between the character and the voice content. According to the corresponding relation, whether the target person speaks or not can be determined, and if the target person speaks, the speaking content can be determined, so that the accuracy of voice recognition is improved, and the probability of mistaken awakening is reduced.

In some embodiments, the audio signal processing and the video signal processing shown in fig. 7 can be implemented locally by the display device, or the display device sends the audio signal and the video signal to the server, and the server processes the audio signal and the video signal at the cloud end and returns the processing result to the display device, or the display device locally implements part of functions, and the server implements part of functions.

In some embodiments, the voiceprint information and the face information of the user are privacy information, the display device may be configured to display option controls of privacy functions such as voiceprint recognition and face recognition when the user starts the navigation and uses the voice assistant function for the first time, and display a prompt for confirming use of the camera and far-field voice, and the user may click the option control after viewing the prompt, select to start the privacy function, so as to improve a voice interaction effect, or select not to trigger the option control, so as to improve privacy security.

To further illustrate the signal processing of the voice interaction process shown in FIG. 7, FIG. 8 illustrates a timing diagram of the voice interaction process according to some embodiments. It should be noted that the timing diagram is only an exemplary timing diagram of the signal processing procedure shown in fig. 7, and in an actual embodiment, the signal processing procedure shown in fig. 7 may also include other timings.

Referring to fig. 8, in the voice interaction process, the display device may interact with the user and the server, respectively, to provide the user with a voice control service.

In fig. 8, the user may be a target person who wants to control the display device, and the microphone of the display device may capture voice or ambient noise of other users in addition to the target person.

In some embodiments, the wake-up word input by the user to the display device may be a voice wake-up instruction, and after receiving the voice wake-up instruction, the display device may perform sound source positioning according to the voice wake-up instruction, calculate a wake-up sound source position, and obtain wake-up sound source position information, where the wake-up sound source position information may include a wake-up angle.

In some embodiments, after the display device calculates the wake-up angle, the camera may be turned on, and the camera may be turned toward the wake-up sound source according to the wake-up angle, and shooting may be continued during the turning process to obtain an image in the dynamic field of view. The awakening sound source is the target person. Of course, if no user inputs a wake-up word to the display device, i.e., the display device is mistakenly woken up, the wake-up sound source may not actually exist.

In some embodiments, after the voice wake-up instruction is collected, in order to avoid that the voice wake-up instruction is a voice instruction for false wake-up, the display device may obtain user identity information of the voice wake-up instruction, then detect a target person in an image in the collected dynamic field, determine that the wake-up is not the false wake-up if the target person can be located in the collected image, and determine that the wake-up is the false wake-up if the target person cannot be located.

In some embodiments, in order to obtain the user identity information, the display device may generate a wake word recognition request including a wake word voice, and send the wake word recognition request to the server. And after receiving the awakening word recognition request, the server extracts awakening word voice from the request, and performs user attribute recognition and voiceprint recognition on the awakening word voice to obtain user identity information of the target person.

In some embodiments, the server may obtain the voiceprint features when performing the user attribute recognition and the voiceprint recognition, then match the voiceprint features with the voiceprint features in the database, and obtain the user identity information of the voice wake-up instruction according to a matching result, where the voiceprint features in the database may be obtained according to audio data previously entered by the user. The user identity information may include voiceprint tagging U1, gender US1, and age UA1 of the target person, wherein gender US1 and age UA1 may be user attributes, and age UA1 may represent an age group, such as 1-10 years, 11-20 years, 20-40 years, 40-60 years, and so on.

In some embodiments, the user attribute recognition and the voiceprint recognition may also be implemented locally by the display device, the user may input an audio on the display device in advance, and the display device may compare the voice wake-up instruction with the audio input by the user in advance through the voiceprint recognition, so as to obtain the user identity information corresponding to the voice wake-up instruction.

In some embodiments, the user identity information may also include a face image of the target person, so as to facilitate subsequent face recognition, the avatar may be stored in the server and/or the display device, and the display device may acquire the face image when acquiring the user identity information corresponding to the voice wake-up instruction.

In some embodiments, in the process of requesting the user identity information from the server, the display device may not adjust the view field of the camera at first, directly control the camera to acquire an image, and perform face recognition on the image acquired by the camera locally. At this time, if the target person is just standing in the current field of view of the camera, the target person can be shot, and if the user is outside the current field of view of the camera, the target person cannot be shot. In order to reduce the resource consumption of the display device, only the age characteristic and the gender characteristic of the human face can be detected during the human face recognition. After the user identity information of the server is received, the gender US1 and the age UA1 in the user identity information can be extracted and compared with the result of face recognition, if the face cannot be detected, or the detected age or gender of the face is not matched with the user identity information, the view field of the camera is adjusted, and the image is shot again for face detection. Of course, if the user identity information includes a face image, during face recognition, it may also be determined whether a face in an image acquired by the camera is consistent with the face image in the user identity information, and compared with only recognizing age and gender, accuracy of target person recognition may be improved, but speed of target person recognition may be reduced.

In some embodiments, if the display device cannot identify the target person from the image captured by the camera in the current field of view, after obtaining the wake-up angle, the display device may adjust the field of view of the camera according to the wake-up angle to capture the target person.

In some embodiments, the display device may adjust the field of view of the camera to cover the wake-up angle. For example, if the wake-up angle is 30 degrees and the current field of view of the display device is 40-140 degrees, the camera can be rotated to the left by more than 10 degrees, so that the target person can be located in the captured image. In the rotating process, the camera can continuously shoot and perform face recognition, if the face matched with the user identity information is recognized, the target person is positioned, and at the moment, the camera can stop rotating.

In some embodiments, after the display device adjusts the field of view of the camera according to the wake-up angle, a face matching the user identity information may still not be recognized from the captured image, and it may be determined that the wake-up angle is calculated incorrectly or that false wake-up occurs. In order to solve the problem of wrong calculation of the awakening angle, the display equipment can control the camera to rotate so as to cover the maximum view field, namely the view field of 0-180 degrees, and if the target person still cannot be identified in the view field of 0-180 degrees, the voice awakening instruction is mistakenly awakened when the voice awakening instruction is confirmed. Or the display device can re-measure the angle of the target person relative to the display device according to the voice real-time instruction received after the voice awakening instruction, then adjust the view field of the camera according to the re-measured angle, and determine that the mistaken awakening occurs if the target person cannot be identified after the view field is adjusted, at this time, the display device can quit the voice interaction process.

In some embodiments, after the display device identifies the target person in the image captured by the camera, the display device may perform face tracking on the target person. Because the target person may move back and forth, the field of view of the camera is dynamically adjusted through face tracking, and the target person is always located in the field of view of the camera. When the human face is tracked, a preset area can be determined in an image shot by the camera, then the real-time coordinate range of the face of a target person in the image shot by the camera is obtained, and the camera is controlled to rotate according to the variation trend of the real-time coordinate range. If the real-time coordinate range is in the preset area, the camera can not be rotated, and if the boundary of one boundary preset area of the real-time coordinate range is overlapped, the camera can be rotated along with the change trend of the real-time coordinate range, so that the face is positioned in the preset area again. For example, if the trend of the real-time coordinate range is leftward translation, the camera is controlled to rotate leftward.

In some embodiments, the voice real-time instructions received by the display device after the voice wake-up instruction may include any one or more of the target person's voice, the voice of other people, and ambient noise. In the image shot by the camera, lip movement detection is carried out on the target person, whether the target person is speaking or not can be judged, if the target person is speaking, semantic recognition can be carried out on the voice real-time instruction, and if the target person is not speaking, semantic recognition can not be carried out on the voice real-time instruction.

In some embodiments, the display device may generate a speech recognition request containing real-time speech, and send the request to the server. And after receiving the voice recognition request, the server performs semantic recognition on the real-time voice in the voice recognition request and returns a semantic recognition result to the display equipment.

In some embodiments, the server may perform voiceprint recognition on the real-time speech first, and perform semantic recognition on the real-time speech and return a semantic recognition result to the display device if the voiceprint recognition result determines that the real-time speech includes the speech of the target person, or the server may perform voiceprint recognition and semantic recognition simultaneously and return a voiceprint recognition result and a semantic recognition result to the display device.

In some embodiments, when performing semantic recognition, the server may perform noise reduction on the voice real-time instruction to improve the semantic recognition accuracy.

In some embodiments, if there are other people besides the target person in the image captured by the camera, the noise reduction processing may be implemented by a human voice separation mechanism. And separating single-path voice based on the mixed voice, performing voiceprint recognition on the single-path audio, judging whether the single-path audio belongs to the voice of the target speaker, performing semantic recognition on the voice of the target speaker, and discarding the voice of the non-target speaker. The voice separation mechanism adopts voiceprint characteristics in the identity information based on the target person and is based on noise source modeling, so that the target voice is enhanced, the voice except the voice of the target speaker is suppressed, and the target voice recognition effect under the scene is optimized.

In some embodiments, the display device may respond according to the recognition result after receiving the recognition result from the server. For example, the recognition result may include: where R denotes a recognition result, U denotes a voiceprint, U1 denotes a voiceprint of a target person whose speech content is "loud". The display device can turn up the volume of the display device according to the identification result.

In some embodiments, the specific processes of the speech recognition method of the present application can be referred to in fig. 9 to fig. 12, and the following describes the technical solution of the present application in combination with the specific processes of the speech recognition method.

Referring to fig. 9, an overall flow diagram of a voice interaction method according to some embodiments is shown. After the user inputs the voice awakening instruction to the display device, as shown in fig. 9, the display device can acquire the voice awakening instruction of the user, calculate the awakening angle according to the voice awakening instruction, rotate the camera according to the awakening angle, and enable the camera to face the awakening angle, so that the acquisition range of the camera covers the awakening angle.

After the display equipment controls the camera to face the awakening angle, the face detection and face recognition can be carried out on the image shot by the camera, so that whether the target person is detected in the image shot by the camera is judged.

If the target person can not be detected in the image shot by the camera, recalculating the awakening angle according to the audio recorded in real time to realize real-time sound source positioning, then rotating the camera according to the sound source positioning result to enable the camera to face the recalculated awakening angle, detecting the face from the image shot by the camera, carrying out face recognition on the detected face, and judging whether the face is the target person.

If the target person is detected in the image shot by the camera, the display equipment controls the camera to track the face of the target person and perform lip movement detection and identification on the image shot by the camera. The image shot by the camera may include the face of the target person and the face of the non-target person, or only include the face of the target person.

If the target person who moves is detected in the image shot by the camera, real-time recording, namely real-time voice instructions collected by an audio input device of the display equipment, can be obtained. The display device may process the beams of the real-time recording by using an in-beam method, and the processing content may include enhancing the voice beam of the target person and suppressing other voice beams, wherein the voice beam of the target person may be determined according to the position of the target person in the image.

If the non-target person who moves the lips is detected in the image shot by the camera, the recording possibly comprises one or more of noise and non-target voice, the display equipment can carry out noise suppression and non-target voice suppression on the real-time recording so as to improve the identification accuracy of the next section of real-time recording, and of course, the display equipment can also directly delete the noise or the non-target voice.

Before face detection and recognition, the display device can determine a target person in advance according to the voice awakening instruction, and the target person can be a user who sends the voice awakening instruction. Referring to fig. 10, a flowchart of a method for processing a voice wake-up command according to some embodiments is shown. As shown in fig. 10, when the display device collects the wake-up audio, it is determined that the wake-up audio includes the voice wake-up instruction through the wake-up word detection, the wake-up angle can be calculated according to the wake-up audio, the wake-up audio is stored in the preset path, and then the wake-up audio in the preset path is uploaded to the cloud, so that the server in the cloud processes the wake-up audio. The server can process the awakening audio by voiceprint recognition and user attribute recognition, and the gender and age of the user corresponding to the awakening audio can be obtained by the voiceprint recognition according to the voiceprint of the user which determines the awakening audio is. And finally, determining the awakening angle calculated locally by the display equipment, and the voiceprint and the user attribute identified by the server at the cloud end as the identification result of the awakening audio.

As can be seen from fig. 10, after the voice wake-up command is collected, the display device calculates the wake-up angle according to the voice wake-up command, but the camera is controlled to rotate toward the wake-up angle, so that the probability of detecting the target person in the captured image can be increased.

In some embodiments, the image captured by the camera may include a single human face and a multi-human face, and to specifically analyze the two cases, fig. 11 shows a flowchart of a processing method when the single human face is detected in the voice interaction process according to some embodiments, and fig. 12 shows a flowchart of a processing method when the multi-human face is detected in the voice interaction process.

Referring to fig. 11, if the display device detects a single face within the wake-up angle, a real-time recording may be collected and lip movement detection may be performed on the detected face. Wherein, including real-time sound source location and high in the clouds discernment to the processing of real-time recording, real-time sound source location is including calculating the real-time sound source angle of real-time recording, high in the clouds discernment includes speech recognition and voiceprint discernment, display device is when carrying out the voiceprint discernment to the server request in the high in the clouds, can upload the voiceprint of voice wake-up instruction and real-time recording to the server together, make the server obtain the voiceprint of real-time recording after, can confirm whether this voiceprint is the voiceprint of same user with the voiceprint of voice wake-up instruction, and contain the result of affirming in the voiceprint recognition result. And according to the voice recognition result and the voiceprint recognition result returned by the server and the result of the real-time sound source positioning locally obtained by the display equipment, the display equipment can obtain information such as a real-time sound source angle, a voiceprint, attributes, voice content and the like corresponding to the real-time recording.

The display device can check whether the real-time recording is the voice of the target speaker according to the voiceprint recognition result. And if the real-time recording is not the voice of the target speaker, the camera is rotated again based on the real-time sound source angle determined by the real-time sound source positioning, so that the camera faces the real-time sound source angle, and then the face detection is carried out in the image shot by the camera. And if the real-time recording is the voice of the target speaker, carrying out face tracking on the face detected by the camera, and then obtaining the lip movement detection result of the face. And if the lip movement detection result is that the lip movement occurs, carrying out voice interaction according to the real-time recording.

Referring to fig. 12, if the display device detects a plurality of faces within the wake-up angle, a real-time recording may be acquired, and lip movement detection may be performed on the detected faces. Wherein, the processing to real-time recording includes real-time sound source location and high in the clouds discernment, real-time sound source location is including calculating the real-time sound source angle of real-time recording, high in the clouds discernment includes speech recognition and voiceprint discernment, display device is when carrying out the voiceprint discernment to the server request in the high in the clouds, can upload the voiceprint of voice wake-up instruction and real-time recording to the server together, make the server obtain the voiceprint of real-time recording after, can confirm whether contained in this real-time recording with the voiceprint of voice wake-up instruction be same user's voiceprint, whether contained the voiceprint of target people promptly, and contain the result of confirming in the voiceprint recognition result. And according to the voice recognition result and the voiceprint recognition result returned by the server and the result of the real-time sound source positioning locally obtained by the display equipment, the display equipment can obtain information such as a real-time sound source angle, a voiceprint, attributes, voice content and the like corresponding to the real-time recording.

The display device can check whether the real-time recording is the voice of the target speaker according to the voiceprint recognition result. And if the real-time recording is not the voice of the target speaker, the camera is rotated again based on the real-time sound source angle determined by the real-time sound source positioning, so that the camera faces the real-time sound source angle, and then the face detection is carried out in the image shot by the camera. If the real-time recording contains the voice of the target speaker, positioning the target person from the face detected by the camera, then tracking the face of the target person, and then obtaining the lip movement detection result of the face of the target person. And if the lip movement detection result is that the lip movement occurs, carrying out voice interaction according to the real-time recording.

In the above embodiment, the camera of the display device or the camera connected to the display device may obtain a larger view field angle by rotating, so as to realize face detection, tracking and lip movement detection within a larger view field range, improve the probability of successful positioning of the target person in the voice interaction, and further improve the awakening rate and the voice recognition accuracy rate of the voice interaction. However, in some embodiments, the camera of the display device or the camera connected to the display device is not provided with a pan-tilt, and the camera is not rotatable, in this case, in the image within the fixed view field range captured by the camera, the face detection, tracking and lip movement detection can still be performed, in this case, the face tracking can be to obtain a real-time coordinate range of the face of the target person, the real-time coordinate range can be a coordinate range of a rectangular region including the face of the target person, and the lip movement detection is to perform lip movement detection on the image within the real-time coordinate range. By face tracking, the range of lip motion detection can be narrowed, and the response speed of voice interaction is improved.

According to the embodiment, the voice interaction experience under the complex voice interaction scene can be effectively improved by integrating the technologies of sound source positioning, face tracking, voiceprint recognition, lip movement detection, noise reduction and the like.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, comprising:

a display for presenting a user interface;

the camera is used for collecting images;

a controller connected with the display, the controller configured to:

collecting a voice awakening instruction;

detecting face information in an image acquired by the camera;

2. The display device according to claim 1, wherein detecting face information in the image captured by the camera comprises:

3. The display device according to claim 1, wherein performing face tracking and lip movement detection on the target person comprises:

and carrying out lip movement detection on the image in the real-time coordinate range.

4. The display device according to claim 1, wherein performing face tracking and lip movement detection on the target person comprises:

5. The display device of claim 1, wherein the controller is further configured to:

if the face of the target person cannot be detected, carrying out sound source positioning on the voice real-time instruction to obtain a real-time sound source position;

and rotating a camera towards the real-time sound source position, and detecting the face of the target person corresponding to the user identity information in the image acquired by the camera.

6. The display device according to claim 1, wherein obtaining the user identity information corresponding to the voice wake-up instruction comprises:

and carrying out voiceprint recognition on the voice awakening instruction to obtain user identity information, wherein the user identity information comprises voiceprint information of the target person.

7. The display device of claim 1, wherein responding to the voice real-time instruction comprises:

if the image shot by the camera only comprises a single human face of the target person, performing directional voice enhancement on the voice corresponding to the awakening sound source position;

and responding to the voice real-time instruction after the directional voice enhancement.

8. The display device of claim 1, wherein responding to the voice real-time instruction comprises:

if the voice real-time instruction comprises multi-path voice, separating the single-path voice of the target person from the voice real-time instruction;

and responding according to the single-channel voice of the target person.

9. The display device of claim 1, wherein the controller is further configured to:

after the voice real-time instruction is collected, the voice real-time instruction is sent to a server, so that the server performs voiceprint recognition, voice recognition and semantic recognition on the voice real-time instruction to obtain a recognition result;

and receiving the recognition result of the server to the voice real-time instruction.

10. A method of voice interaction, comprising:

collecting a voice awakening instruction;

detecting face information in an image acquired by a camera;