CN111951787A

CN111951787A - Voice output method, device, storage medium and electronic equipment

Info

Publication number: CN111951787A
Application number: CN202010761629.3A
Authority: CN
Inventors: 胡可鑫; 魏晨; 雷宗; 秦斌; 王刚
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-17

Abstract

The present disclosure relates to a voice output method, apparatus, storage medium, and electronic device, the method comprising: monitoring the action behavior of a target object through an image acquisition unit to acquire image information containing the action behavior; determining a target scene of the action behavior and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information; after the target scene and the target event are determined, controlling the voice assistant system to be started; determining a target voice from a voice library of the voice assistant system according to the target scene and the target event; and outputting the target voice, wherein the voice text corresponding to the target voice is a feedback content text aiming at the action. The voice assistant can be awakened in response to the action behaviors of the target object, the action behaviors of the user are identified, feedback voice aiming at the action behaviors is output to actively interact with the user, and the intelligent degree of the intelligent voice assistant is improved.

Description

Voice output method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and an apparatus for outputting speech, a storage medium, and an electronic device.

Background

After the intelligent voice assistant Siri of apple company initiates the first river of the intelligent voice assistant, the voice assistant systems of various science and technology companies are also developed like spring shoots after rain. The voice assistant system in the mobile terminal or the intelligent household appliance can receive the voice instruction of the user and carry out voice communication with the user according to the preset interactive logic in the system or assist the user in controlling the mobile terminal or the intelligent household appliance. In the related art, a user usually needs to speak a fixed wake-up word set by a manufacturer to start the voice assistant system, and output voice to interact with the voice assistant after the voice assistant system is woken up.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a voice output method, apparatus, storage medium, and electronic device.

According to a first aspect of embodiments of the present disclosure, there is provided a speech output method, the method including: the method is applied to the electronic equipment, wherein the voice assistant system is arranged in the electronic equipment, and the method comprises the following steps:

monitoring the action behavior of a target object through an image acquisition unit to acquire image information containing the action behavior;

determining a target scene of the action behavior and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information;

after the target scene and the target event are determined, controlling the voice assistant system to be started;

determining a target voice from a voice library of the voice assistant system according to the target scene and the target event;

and outputting the target voice, wherein the voice text corresponding to the target voice is a feedback content text aiming at the action.

Optionally, the image information includes a video with a preset duration, and the multi-modal recognition model includes: the method comprises a scene recognition model and an event recognition model, wherein the target scene of the action behavior and the target event corresponding to the action behavior are determined according to a multi-modal recognition model trained in advance and the image information, and the method comprises the following steps:

acquiring a first image and a plurality of second images from the image information, wherein the first image is a background image of the action behavior, and the second images are human figures used for representing the action behavior;

taking the first image as an input of the scene recognition model to obtain a target scene label which is output by the scene recognition model and used for representing the target scene;

and taking the plurality of second images as the input of the event recognition model to obtain a target event label which is output by the event recognition model and used for representing the target event.

Optionally, the acquiring a first image and a plurality of second images from the image information includes:

acquiring a video image of each frame in the image information;

dividing each video image into a non-portrait part and a portrait part through a preset image recognition algorithm;

splicing a plurality of non-portrait parts in the image information through a preset image splicing algorithm to obtain the first image;

and taking a plurality of portrait parts in the image information as the plurality of second images.

Optionally, before the monitoring, by the information acquisition device, the action behavior of the target object to obtain the image information including the action behavior, the method further includes:

the multi-modal recognition method includes the steps that a preset classification model is trained through first training data and second training data respectively to obtain the multi-modal recognition model, the first training data comprise a plurality of background images and scene labels corresponding to the background images, and the second training data comprise a plurality of groups of portraits used for representing different action behaviors and event labels corresponding to the portraits.

Optionally, the voice library corresponds to a tag association table used for representing association relations among a scene tag, an event tag, and a voice tag, and determining a target voice from the voice library of the voice assistant system according to the target scene and the target event includes:

after the voice assistant system is started, determining a target voice tag from the tag association table according to the target scene tag and the target time tag;

and acquiring the voice corresponding to the target voice label as the target voice.

According to a second aspect of the embodiments of the present disclosure, there is provided a voice output apparatus applied to an electronic device, in which a voice assistant system is provided, the apparatus including:

the behavior monitoring module is configured to monitor the action behavior of the target object through the image acquisition unit so as to acquire image information containing the action behavior;

the behavior recognition module is configured to determine a target scene where the action behavior occurs and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information;

the system starting module is configured to control the voice assistant system to start after the target scene and the target event are determined;

a voice determination module configured to determine a target voice from a voice library of the voice assistant system according to the target scenario and the target event;

and the voice output module is configured to output the target voice, and a voice text corresponding to the target voice is a feedback content text aiming at the action behavior.

Optionally, the image information includes a video with a preset duration, and the multi-modal recognition model includes: a scene recognition model and an event recognition model, the behavior recognition module configured to:

Optionally, the behavior recognition module is configured to:

acquiring a video image of each frame in the image information;

Optionally, the apparatus further comprises:

the model training module is configured to train a preset classification model through first training data and second training data respectively to obtain the multi-modal recognition model, the first training data include a plurality of background images and scene labels corresponding to the background images, and the second training data include a plurality of groups of human figures used for representing different action behaviors and event labels corresponding to the human figure images.

Optionally, the voice library corresponds to a tag association table for representing an association relationship among a scene tag, an event tag, and a voice tag, and the voice determination module is configured to:

controlling the voice assistant system to be started under the condition that the target scene and the target event are determined;

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech output method provided by the first aspect of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic device, in which a voice assistant system is disposed; the electronic device includes: the second aspect of the present disclosure provides a voice output device.

According to the technical scheme provided by the embodiment of the disclosure, the action behavior of the target object can be monitored through the image acquisition unit so as to acquire the image information containing the action behavior; determining a target scene of the action behavior and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information; after the target scene and the target event are determined, controlling the voice assistant system to be started; determining a target voice from a voice library of the voice assistant system according to the target scene and the target event; and outputting the target voice, wherein the voice text corresponding to the target voice is a feedback content text aiming at the action. The voice assistant can be awakened in response to the action behaviors of the target object, the action behaviors of the user are identified, feedback voice aiming at the action behaviors is output to actively interact with the user, and the intelligent degree of the intelligent voice assistant is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of speech output according to an exemplary embodiment;

FIG. 2 is a flow chart of a method of determining scenes and events according to the method shown in FIG. 1;

FIG. 3 is a flow chart of another speech output method according to that shown in FIG. 1;

FIG. 4 is a flow chart of a method of determining interactive speech according to the method shown in FIG. 1;

FIG. 5 is a block diagram illustrating a speech output device according to an exemplary embodiment;

FIG. 6 is a block diagram of another speech output device according to that shown in FIG. 5;

FIG. 7 is a block diagram illustrating an apparatus for speech output according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Before introducing the voice output method provided by the present disclosure, a target application scenario related to each embodiment in the present disclosure is first introduced, where the target application scenario includes an electronic device, the electronic device is an electronic device provided with a camera or connected to the camera, and the electronic device may be, for example, a Personal computer, a notebook computer, a smart phone, a tablet computer, a smart television, a smart watch, a PDA (Personal Digital Assistant), and other electronic devices. The electronic equipment is internally provided with a voice assistant system based on a full-knowledge function, and the voice assistant system comprises a multi-modal perception layer and a multi-modal cognition layer.

Illustratively, the multi-modal awareness layer is a knowledge acquisition module for acquiring knowledge from four dimensions of user portrayal, user life data, objective events and life general knowledge. The user-related knowledge mainly comprises a user portrait and user life data (subjective dimension), wherein the user portrait comprises: identity information of the user, interest tags of the user, and the like. User life data is determined based on a log of past uses of the electronic device by the user, which may include: history of user usage of electronic device functions, e.g., alarm setting, calendar, express queries, schedule and travel schedules, etc. The objective events and the life general knowledge are objective dimensions, wherein the objective events may include: vital messages, weather forecasts, holidays, etc. The common sense of life may include: ticket robbery, different health preservation and adjustment in different seasons, change of solar terms, news of major events in different geographical positions and the like usually occur 2 months before a long holiday. The multi-modal cognitive layer is used for analyzing the knowledge with different dimensions, converting the knowledge into the potential requirements of the user and further into voice topics, and taking the voice topics as candidates of voice which is subsequently output to the user.

Fig. 1 is a flowchart illustrating a voice output method according to an exemplary embodiment, and the method is applied to the electronic device described in the application scenario, as shown in fig. 1, and includes the following steps:

in step 101, the motion behavior of the target object is monitored by the image capturing unit to obtain image information including the motion behavior.

For example, the image capturing unit may be regarded as a device in the multi-modal perception layer, and is configured to capture the video information of the user as the knowledge corresponding to the full-knowledge function. The image acquisition unit may be, for example, a camera on a smart television in the user's living room. Under the condition that the user allows, the camera can be in an open state for a long time, and all action behaviors of the user in front of the intelligent television are monitored. And starting to enter the step of acquiring the image information when the user is determined to be present in the shooting picture of the camera. The step of acquiring the image information may include: and intercepting a section of video every preset time length from the time point when the user enters the shooting picture, storing the video as the image information until the user leaves the shooting picture, and stopping the step of acquiring the image information.

For example, the action behavior can be a non-instruction action which is not related to the control of the electronic device and is made unconsciously by the user, such as the user falls down when walking, a guardian hiccups when a baby chokes milk, or a driver closes eyes and lowers head during driving.

In step 102, a target scene where the action behavior occurs and a target event corresponding to the action behavior are determined according to the multi-modal recognition model trained in advance and the image information.

For example, the multi-modal recognition model is a part of the multi-modal recognition layer, in the step 101, at least one piece of video is obtained as the video information through the multi-modal perception layer, and in the step 102, the scene (i.e., the target scene) and the event (i.e., the target event) corresponding to each piece of video information need to be recognized and analyzed. Based on the action behavior described above, the scenario of the action behavior may include: family scenes, driving scenes and the like, and the events corresponding to the action behaviors can include: fall injury, baby choking, and fatigue driving. The image information is classified through the multi-modal recognition model, and then a target scene and a target event corresponding to the action behavior can be recognized. It should be noted that the target scene and the target event are one of a plurality of preset specific scenes and specific events, and not every piece of image information has a corresponding target scene and target event, and there is image information that cannot be recognized by the multi-modal recognition model or is directly recognized as an invalid scene or an invalid event by the multi-modal recognition model in the plurality of pieces of image information acquired in step 101. For example, a segment contains video information that the user has walked through the living room, and does not relate to a preset specific scene or a specific event.

In step 103, the voice assistant system is controlled to be turned on in case of determining the target scene and the target event.

In step 104, a target voice is determined from the voice library of the voice assistant system according to the target scenario and the target event.

Illustratively, it should be noted that, in the embodiment of the present disclosure, the detection of the target scenario corresponding to the action behavior and the occurrence of the target event are taken as conditions for waking up the voice assistant system. If no action related to a specific scene or a specific event is detected in step 102, ignoring the piece of image information; if the action behavior related to the specific scene or the specific event is detected in step 102, and the target scene and the target event corresponding to the action behavior are determined, the voice assistant system is awakened, a proper voice is determined from a voice library of the voice assistant system according to the target scene and the target event as a target voice, and the target voice is played.

Specifically, taking the electronic device as an example of an intelligent television in the home of the user, the intelligent television monitors a shooting picture of a camera of the user at 10 am, and generates image information a from 10 o ' clock to 10 o ' clock and 20 o ' clock, and image information B from 10 o ' clock and 20 o ' clock and 10 o ' clock and 30 o ' clock. The image information A and the image information B only comprise the action behaviors that the user sits on a sofa opposite to the television, and the image information C comprises the action behaviors that the user walks beside the tea table and stumbles sundries on the ground. It is determined by the multimodal recognition model that the visual information a and the visual information B relate to a specific scene (i.e. a living room), but not to a specific event, while the visual information C relates to a specific scene-a living room scene, and to a specific event-a fall injury event. Therefore, the video information a and the video information B can be ignored, the voice assistant is not woken up, the target scene corresponding to the video information C is determined to be a home scene, the target event is determined to be a fall injury event, and the voice assistant system is started to be woken up at 10 o' clock 30. At the moment, the information of the family scene and the tumble injury event is sent to a holonomic module of a multi-mode cognitive layer in the voice assistant system, target voices corresponding to the living room scene and the tumble injury event at the same time are screened out from a voice library contained in the holonomic module, and the target voices are actively communicated with the user through a loudspeaker of the smart television under the condition that the user does not send out awakening words and instruction voices.

In step 105, the target speech is output.

And the voice text corresponding to the target voice is a feedback content text aiming at the action behavior.

For example, if the action behavior is a person falling, the target scene is a family scene, the target event is a fall injury event, and the feedback content text corresponding to the target voice may be, for example, "whether to dial 120" or "whether to dial a pre-stored family phone". If the action behavior is to slap the back of the baby, the target scene is a family scene, the target event is a milk choking event, and the feedback content text corresponding to the target voice can be, for example, "whether the baby choks milk" or "whether to query a processing flow of the milk choking event of the baby". If the action behavior is frequent blinking of eyes and kneading eyes, the target scene is a driving scene, the target event is a fatigue driving event, and the feedback content text corresponding to the target voice can be 'you are in a fatigue driving state' or 'whether to inquire the latest high speed rest area'.

In summary, the technical solution provided by the embodiments of the present disclosure can monitor the action behavior of the target object through the image acquisition unit to obtain the image information including the action behavior; determining a target scene of the action behavior and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information; after the target scene and the target event are determined, controlling the voice assistant system to be started; determining a target voice from a voice library of the voice assistant system according to the target scene and the target event; and outputting the target voice, wherein the voice text corresponding to the target voice is a feedback content text aiming at the action. The voice assistant can be awakened in response to the action behaviors of the target object, the action behaviors of the user are identified, feedback voice aiming at the action behaviors is output to actively interact with the user, and the intelligent degree of the intelligent voice assistant is improved.

Fig. 2 is a flow chart of a method of determining scenes and events according to fig. 1, and as shown in fig. 2, this step 102 may include:

in step 1021, a first image and a plurality of second images are obtained from the video information.

The first image is a background image of the action behavior, and the second image is a portrait for representing the action behavior.

Illustratively, the scene corresponding to the action behavior is determined according to the background image in the influence information, the event corresponding to the action behavior is determined according to the action itself made by the person, and the action of the person is represented by the image including the portrait. Therefore, before detecting a scene and time by the multi-modal recognition model, it is necessary to extract both images of a person (portrait) and an nobody (non-portrait) from the image information. The characteristics of the space in front of the camera can be reflected by a background image of the non-portrait, and then a scene corresponding to the background image is determined. For the event judgment, a plurality of images including the portrait are required to reflect the whole process of the action, so that a first image and a plurality of second images are required to be obtained from the image information. Specifically, the step 1021 may include: acquiring a video image of each frame in the image information; dividing each video image into a non-portrait part and a portrait part through a preset image recognition algorithm; splicing a plurality of non-portrait parts in the image information through a preset image splicing algorithm to obtain a first image; and taking a plurality of portrait parts in the video information as the plurality of second images. It should be noted that, based on the above-mentioned manner of acquiring the video information (when the user enters the shooting picture and starts to record the video information), the plurality of portrait portions in the video information are actually images of each frame in the video corresponding to the video information, and in the actual operation process, for the action behavior with a relatively large action amplitude, it can be determined without being accurate to the image of each frame, so that the plurality of second images here can be a part of all the video frame images in the video.

In step 1022, the first image is used as an input of the scene recognition model to obtain an object scene tag output by the scene recognition model for characterizing the object scene.

In step 1023, the plurality of second images are used as the input of the event recognition model to obtain the target event tag for representing the target event output by the event recognition model.

Fig. 3 is a flow chart of another speech output method according to fig. 1, as shown in fig. 3, before the step 101, the method may further include:

in step 106, a preset classification model is trained through the first training data and the second training data, respectively, to obtain the multi-modal recognition model.

The first training data comprise a plurality of background images and scene labels corresponding to the background images, and the second training data comprise a plurality of groups of portraits used for representing different action behaviors and event labels corresponding to the portraits.

For example, the same or different classification models (preset classification models, which may be, for example, decision tree models, support vector machine models, or neural network models) may be trained by different training data (the first training data and the second training data) to obtain the scene recognition model and the event recognition model. It should be noted that the scene label or the time label in the first training data and the second training data may be an empty label that does not have a specific meaning or cannot be used as a basis for determining the output speech. In

steps

1022 and 1023, if either of the scene recognition model and the event recognition model outputs an empty tag, then the wake-up of the voice assistant system is not triggered. This means that the voice assistant system does not respond to the user's actions in a certain scenario.

Fig. 4 is a flow chart of a method of determining interactive speech according to fig. 1, as shown in fig. 3, the step 104 may include:

in step 1041, after the voice assistant system is turned on, a target voice tag is determined from the tag association table according to the target scene tag and the target time tag.

Illustratively, the speech library includes topics of multiple dimensions, such as an Internet of Things (IoT) control topic, a knowledge-related topic, an event reminder topic, an emotional concern topic, and an interest content topic. The voice library used in the embodiment of the present disclosure is included in, for example, knowledge-related topics, and the voice library corresponds to a tag association table for representing association relations among scene tags, event tags, and voice tags. After the target scene tag and the target event tag are determined through the

above steps

1022 and 1023, the tag association table may be directly queried through the two tags to obtain the target voice tags corresponding to the two tags, and further, the voice with the target voice tag is taken as the target voice.

In step 1042, a voice corresponding to the target voice tag is obtained as the target voice.

Fig. 5 is a block diagram of a speech output apparatus according to an exemplary embodiment, and as shown in fig. 5, the speech output apparatus 500 is applied to the electronic device described in the application scenario, and includes:

a behavior monitoring module 510 configured to monitor an action behavior of a target object through an image acquisition unit to obtain image information including the action behavior;

a behavior recognition module 520 configured to determine a target scene where the action behavior occurs and a target event corresponding to the action behavior according to a pre-trained multi-modal recognition model and the image information;

a system start module 530 configured to control the voice assistant system to start after determining the target scene and the target event;

a voice determination module 540 configured to determine a target voice from a voice library of the voice assistant system according to the target scenario and the target event;

and a voice output module 550 configured to output the target voice, where a voice text corresponding to the target voice is a feedback content text for the action.

Optionally, the image information includes a video with a preset duration, and the multi-modal recognition model includes: a scene recognition model and an event recognition model, the behavior recognition module 520 configured to:

Optionally, the behavior recognizing module 520 is configured to:

acquiring a video image of each frame in the image information;

Fig. 6 is a block diagram of another speech output apparatus according to fig. 5, and as shown in fig. 5, the apparatus 500 may further include:

the model training module 560 is configured to train a preset classification model through first training data and second training data respectively to obtain the multi-modal recognition model, where the first training data includes a plurality of background images and scene labels corresponding to each background image, and the second training data includes a plurality of groups of human figures used for representing different action behaviors and event labels corresponding to each group of human figure images.

Optionally, the voice library corresponds to a tag association table for characterizing association relations among a scene tag, an event tag, and a voice tag, and the voice determination module 540 is configured to:

In summary, the technical solution provided by the embodiments of the present disclosure can monitor the action behavior of the target object through the image acquisition unit to obtain the image information including the action behavior; determining a target scene of the action behavior and a target event corresponding to the action behavior according to a multi-modal recognition model trained in advance and the image information; after the target scene and the target event are determined, controlling the voice assistant system to be started; determining a target voice from a voice library of the voice assistant system according to the target scene and the target event; and outputting the target voice, wherein the voice text corresponding to the target voice is a feedback content text aiming at the action. The voice assistant is awakened in response to the action behaviors of the target object, the action behaviors of the user are identified, feedback voice aiming at the action behaviors is output to actively interact with the user, and the intelligent degree of the intelligent voice assistant is improved.

Fig. 7 is a block diagram illustrating an apparatus 700 for speech output according to an example embodiment. For example, the apparatus 700 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, apparatus 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 707, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, and a communication component 716.

The processing component 702 generally controls overall operation of the device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 702 may include one or more processors 720 to execute instructions to perform all or a portion of the steps of the speech output method described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 707 and the processing component 702.

The memory 704 is configured to store various types of data to support operations at the apparatus 700. Examples of such data include instructions for any application or method operating on device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 706 provides power to the various components of the device 700. The power components 706 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 700.

The multimedia component 707 includes a screen that provides an output interface between the device 700 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 707 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 700 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 710 is configured to output and/or input audio signals. For example, audio component 710 includes a Microphone (MIC) configured to receive external audio signals when apparatus 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the apparatus 700. For example, sensor assembly 714 may detect an open/closed state of device 700, the relative positioning of components, such as a display and keypad of device 700, sensor assembly 714 may also detect a change in position of device 700 or a component of device 700, the presence or absence of user contact with device 700, orientation or acceleration/deceleration of device 700, and a change in temperature of device 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 716 is configured to facilitate wired or wireless communication between the apparatus 700 and other devices. The apparatus 700 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 716 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described voice output methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the device 700 to perform the speech output method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned speech output method when executed by the programmable apparatus.

The device for voice output provided by the embodiment of the disclosure can respond to the action behavior of the target object to wake up the voice assistant, recognize the action behavior of the user, and then output the feedback voice aiming at the action behavior to actively interact with the user, thereby improving the intelligent degree of the intelligent voice assistant.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A voice output method is applied to an electronic device, wherein a voice assistant system is arranged in the electronic device, and the method comprises the following steps:

2. The method of claim 1, wherein the image information comprises a video with a preset duration, and the multi-modal recognition model comprises: the method comprises a scene recognition model and an event recognition model, wherein the target scene of the action behavior and the target event corresponding to the action behavior are determined according to a multi-modal recognition model trained in advance and the image information, and the method comprises the following steps:

3. The method of claim 2, wherein said obtaining a first image and a plurality of second images from said image information comprises:

acquiring a video image of each frame in the image information;

4. The method according to claim 2, wherein before the monitoring of the action behavior of the target object by the information acquisition device to obtain the image information including the action behavior, the method further comprises:

5. The method according to claim 2, wherein the voice library corresponds to a tag association table for representing an association relationship among a scene tag, an event tag and a voice tag, and the determining the target voice from the voice library of the voice assistant system according to the target scene and the target event comprises:

6. A voice output device is applied to an electronic device, wherein a voice assistant system is arranged in the electronic device, and the device comprises:

7. The apparatus of claim 6, wherein the image information comprises a video with a preset duration, and the multi-modal recognition model comprises: a scene recognition model and an event recognition model, the behavior recognition module configured to:

8. The apparatus of claim 6, wherein the behavior recognition module is configured to:

acquiring a video image of each frame in the image information;

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 6, wherein the voice library corresponds to a tag association table for characterizing association relationships among scenario tags, event tags, and voice tags, and the voice determination module is configured to:

11. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 5.

12. An electronic device, wherein a voice assistant system is arranged in the electronic device;

the electronic device includes: the speech output device of any one of claims 6-10.