CN117275473A

CN117275473A - Multi-screen voice control method, device, equipment and computer readable storage medium

Info

Publication number: CN117275473A
Application number: CN202210679889.5A
Authority: CN
Inventors: 冯贝
Original assignee: Beijing Rockwell Technology Co Ltd
Current assignee: Beijing Rockwell Technology Co Ltd
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2023-12-22

Abstract

The present disclosure relates to a multi-screen voice control method, apparatus, device, and computer-readable storage medium, which can detect a target screen viewed by a user who utters a real-time voice among a plurality of screens using a real-time image after receiving the real-time voice and the real-time image in a space to which the real-time voice belongs; after the target screen is determined, a target control instruction matched with real-time voice is searched in a target control instruction set corresponding to an interactive interface currently displayed by the target screen, and the target screen is controlled to execute the searched control instruction, so that when a user performs voice control on any screen in a visible and/or-speaking mode, the target screen which the user wants to perform voice control can be determined from a plurality of screens, misoperation situations such as no-screen response or inconsistent response screens and the user actually want to control screens are reduced, and user experience is improved.

Description

Multi-screen voice control method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of speech recognition technology, and in particular, to a multi-screen speech control method, apparatus, device, and computer readable storage medium.

Background

It can be said that a man-machine interaction mode of replacing touch operation, key operation and the like with voice control instructions, namely, the user directly speaks text information of a system screen, and the effect of operating the text region can be achieved.

However, when a user issues a voice control command in a case where one voice control device is configured with a plurality of screens in one space, if the voice control device cannot distinguish the screen that the voice control command needs to control well, the voice control device may be caused to malfunction, which may further result in poor user experience.

Disclosure of Invention

In order to solve the technical problems described above, the present disclosure provides a multi-screen voice control method, apparatus, device and computer-readable storage medium.

In a first aspect, an embodiment of the present disclosure provides a multi-screen voice control method, including:

receiving real-time voice and a real-time image in a space to which the real-time voice belongs;

detecting a target screen to which a user who utters real-time voice looks among a plurality of screens based on the real-time image;

inquiring a target control instruction matched with real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to control data of an interactive interface being displayed by the target screen;

And if the target control instruction is queried, controlling the target screen to execute the target control operation corresponding to the target control instruction.

In a second aspect, embodiments of the present disclosure provide a multi-screen voice control apparatus, including:

the receiving module is used for receiving the real-time voice and the real-time image in the space where the real-time voice belongs;

a detection module for detecting a target screen to which a user who utters real-time voice looks among a plurality of screens based on the real-time image;

and the query module is used for querying a target control instruction matched with the real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to control data of the interactive interface being displayed by the target screen.

And the control module is used for controlling the target screen to execute the target control operation corresponding to the target control instruction if the target control instruction is inquired.

In a third aspect, embodiments of the present disclosure provide a multi-screen voice control apparatus, including:

a memory;

a processor;

a computer program;

wherein the computer program is stored in said memory and configured to be executed by a processor to carry out the method as in the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program for execution by a processor to perform the method of the first aspect.

In a fifth aspect, the presently disclosed embodiments also provide a computer program product comprising a computer program or instructions which, when executed by a processor, implements a multi-screen speech control method as described above.

The multi-screen voice control method, the device, the equipment and the computer readable storage medium provided by the embodiment of the disclosure can utilize the real-time image to detect a target screen seen by a user who sends out real-time voice in a plurality of screens after receiving the real-time voice and the real-time image in a space to which the real-time voice belongs; after the target screen is determined, a target control instruction matched with real-time voice is searched in a target control instruction set corresponding to an interactive interface currently displayed by the target screen, and the target screen is controlled to execute the searched control instruction, so that when a user performs voice control on any screen in a visible and/or-speaking mode, the target screen which the user wants to perform voice control can be determined from a plurality of screens, misoperation situations such as no-screen response or inconsistent response screens and the user actually want to control screens are reduced, and user experience is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flow chart of a multi-screen voice control method according to an embodiment of the disclosure;

FIG. 2 is a flowchart of another multi-screen voice control method according to an embodiment of the disclosure;

FIG. 3 is a flowchart of another method for multi-screen voice control according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a multi-screen voice control device according to an embodiment of the disclosure;

fig. 5 is a schematic structural diagram of a multimedia voice control device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

In the related art, when there are a plurality of screens in one space, after a user issues a voice control instruction, it is impossible to distinguish the screen that the user wants to control.

Taking the space where the voice control device is located as the cabin of the vehicle as an example, most of the voice control devices in the cabin of the vehicle can support visible and can be said man-machine interaction modes at present. When the voice control device is configured with a screen, for example, the voice control device is configured with a central control screen corresponding to the main driving position area, if a user in the cockpit sends out a voice control instruction, the central control screen can directly respond to the voice control instruction to execute corresponding operation; however, when the voice control device is configured with a plurality of screens, for example, when the voice control device is configured with a center control screen corresponding to a main driving position area, a co-driving screen corresponding to a co-driving position area, and a rear-row screen corresponding to a rear-row position area, if a user in a cockpit issues a voice control instruction, the voice control device cannot determine which screen the user needs to control, which easily causes that the voice control device does not control any screen to respond to the voice control instruction or that the screen that the voice control device controls to respond to the voice control instruction is inconsistent with the screen that the user actually wants to control, thus causing misoperation of the voice control device and further causing poor user experience.

In view of the foregoing, embodiments of the present disclosure provide a multi-screen voice control method, apparatus, device, and computer-readable storage medium. The multi-screen voice control method will be first described with reference to specific embodiments. Fig. 1 is a flowchart of a multi-screen voice control method according to an embodiment of the disclosure.

In the embodiment of the present disclosure, the multi-screen voice control method may be performed by a voice control apparatus, wherein the voice control apparatus may be an electronic apparatus configured with a plurality of screens and having a voice control function.

As shown in fig. 1, the multi-screen voice control method mainly includes the following steps:

s110, receiving real-time voice and real-time images in the space where the real-time voice belongs.

In the embodiment of the disclosure, when a user wants to perform voice control on a certain screen among a plurality of screens in a space, the user can send out real-time voice, so that the voice control device can receive the real-time voice sent out by the user in the space and acquire a real-time image in the space while receiving the real-time voice.

In the embodiment of the present disclosure, the space to which the real-time voice belongs may specifically be a space in which the voice control apparatus is installed. Alternatively, the space may include a cabin of a vehicle, a room of a house, etc., without limitation.

The real-time speech may be speech uttered by the user in real-time, the speech may be an audio signal containing user control requirements, and the speech control device may control the function or functions being displayed in the screen based on the user control requirements if the user control requirements are in accordance with the visual and spoken interactive form.

In the embodiment of the disclosure, the voice control device may acquire real-time voice through the audio acquisition device.

In some embodiments, an audio collection device may be installed in the space, where the audio collection device may collect all real-time voices in the space to which the voices belong in real time, and the voice control device may receive the real-time voices collected by the audio collection device.

In other embodiments, a plurality of audio collection devices may be installed in the space, each audio collection device may have a collection range, the collection ranges of all the audio collection devices may overlap to cover the range of the complete space, each audio collection device collects real-time voices in the collection range of each audio collection device, and the voice control device may receive real-time voices collected by all the audio collection devices.

The audio capturing device may include, among other things, a microphone, a recorder, etc., without limitation.

The real-time image may be an image in space at the moment when the user utters real-time speech.

In the embodiment of the disclosure, the voice control device may acquire the real-time image through the image acquisition device.

In some embodiments, an image acquisition device may be installed in the space, the image acquisition device may acquire images of all spatial ranges in the space, and the voice control device may receive real-time images acquired by the image acquisition device.

In other embodiments, a plurality of image capturing devices may be installed in the space, each image capturing device may have a capturing range, the capturing ranges of all the image capturing devices may overlap to cover the range of the complete space, each image capturing device may respectively capture real-time images in the capturing ranges thereof, and the voice control device may receive the real-time images captured by all the image capturing devices.

The image capturing device may be a camera, a video camera, a camera, or the like, which is not limited herein.

S120, detecting a target screen which is seen by a user making real-time voice in a plurality of screens based on the real-time image.

In the embodiment of the present disclosure, the voice control apparatus may detect a target screen, from which a user who utters real-time voice looks, among a plurality of screens according to the real-time image in response to the received real-time voice and real-time image.

Under the visible and i.e. the scene, the user can generally look at the screen to be controlled when sending out the real-time voice so as to ensure that the screen can meet the control requirement of the user, and the real-time image can embody the state of the user when sending out the real-time voice, so that the voice control equipment can carry out image detection on the real-time image so as to analyze the screen to be seen by the user in a plurality of screens and take the screen as the target screen to be controlled by the user.

Taking a space as a cabin of a vehicle as an example, when the voice control device is configured with a central control screen corresponding to a main driving position area, a co-driving screen corresponding to a co-driving position area and a rear-row screen corresponding to a rear-row position area, the voice control device can perform image detection on real-time images in the cabin collected by a camera, analyze a screen seen by a user in the central control screen, the co-driving screen and the rear-row screen, and take the screen as a target screen which the user wants to control.

S130, inquiring a target control instruction matched with the real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to the interactive interface control data being displayed by the target screen.

In the embodiment of the disclosure, after determining a target screen, the voice control device determines a target control instruction set corresponding to the target screen in one or more control instruction sets, and queries a target control instruction matched with real-time voice.

Because the user performs control operation on one or some functions in the interactive interface which is being displayed by the target screen in the visible and i.e. the scene, after the voice control device determines the target screen, the target control instruction set corresponding to the target screen needs to be determined first, and then the target control instruction matched with the real-time voice sent by the user can be queried in the target control instruction set.

The interactive interface that the target screen is displaying may be a home screen page of the voice control device or may be a page of an application program configured in the voice control device. At least one page control is included in a home screen page of the voice control device or a page of an application program configured in the voice control device, and the page control may be a control, such as a button, an option, an icon or a link, in an interface, which can be controlled by a user, and is not limited herein.

The control instruction may be a control instruction generated by the voice control device according to static control data and/or dynamic control data in the interactive interface control data being displayed by a screen configured in the voice control device.

Taking a screen in a cockpit of a vehicle as an example, when the voice control device is configured with a central control screen corresponding to a main driving position area, a co-driving screen corresponding to a co-driving position area and a rear-row screen corresponding to a rear-row position area, the voice control device can generate a control instruction according to static control data and/or dynamic control data in an interactive interface which is being displayed by each screen.

The static control data may be control text corresponding to a static control in the interactive interface being displayed. The static control may be an interface control that is always fixedly displayed, i.e., the static control does not change with user preference or settings.

Taking a central control screen corresponding to a main driving position area in a cockpit of a vehicle as an example, when an interactive interface being displayed by the central control screen is a main screen of voice control equipment, the static control can be icons such as 'setting', 'file management' and the like built in the voice control equipment; when the interactive interface being displayed by the central control screen is an application page configured in the voice control device, the static control may be a "close" button that enters the setting main page after clicking a "set" icon.

The dynamic control data may be control text corresponding to the dynamic control in the interactive interface being displayed. The dynamic control may be an interface control that can be updated dynamically, or changed with user preference or settings.

Taking a central control screen corresponding to a main driving position area in a cab of the vehicle as an example, when the interactive interface being displayed by the central control screen is a main screen of the voice control device, the dynamic control can be icons such as music, map and the like configured by the voice control device; when the interactive interface being displayed by the central control screen is an application page configured in the voice control device, the dynamic control may be an "online song" list that enters the music main page after clicking an icon of the "music" application.

And the voice control equipment generates a control instruction set corresponding to one or more screens according to the static control data and/or the dynamic control data in the displaying interactive interface.

In some embodiments, after determining the target screen, the voice control device may generate a control instruction set according to static control data and/or dynamic control data included in the interactive interface being displayed by the target screen, and take the control instruction set as a target control instruction set.

In other embodiments, before determining the target screen, the voice control device may pre-generate the control instruction set according to static control data and/or dynamic control data included in the interactive interface being displayed by all the screens, and at the same time, the voice control device may correspond the pre-generated control instruction set to the identification information of each screen. Therefore, after the voice control device determines the target screen, a control instruction set corresponding to the target screen is found from a control instruction set generated in advance according to the identification information of the target screen, and the control instruction set corresponding to the target screen is taken as the target control instruction set.

The identification information of the screen may be a screen unique number, which is used to distinguish between different screens.

Optionally, in the target control instruction set corresponding to the target screen, the query of the target control instruction matched with the real-time voice may be: and converting the real-time voice into voice text, and inquiring the target control instruction matched with the voice text in a target control instruction set.

Specifically, the user's real-time voice can be input into an automatic voice recognition (Automatic Speech Recognition, ASR) engine which is set offline, so as to obtain the target voice text output by the ASR engine, and then the target control instructions matched with the target voice text are searched in the control instructions of the target control instruction set.

The target control instruction and the target voice text are matched, which may be that the target voice text contains any verb and any control text word in the target control instruction, or that the verb in the target voice text is the same as any verb in the target control instruction, and the similarity of the noun in the target voice text and any control text word in the target control instruction is greater than or equal to a preset similarity threshold.

Thus, the user's voice control intent can be determined by querying the target control instruction set for a target control instruction matching the user's control voice.

And S140, if the target control instruction is queried, controlling the target screen to execute the target control operation corresponding to the target control instruction.

In the embodiment of the disclosure, if the voice control device determines that the voice control device inquires a target control instruction matched with the user control voice, a target control operation corresponding to the target control instruction can be executed; if the voice control device determines that the target control instruction matched with the user control voice is not queried, the voice control device can continue to detect the user voice and wait for the next user control voice.

Optionally, the query of the target control instruction may control the target screen to execute the target control operation corresponding to the target control instruction, which may be a control for the interactive interface being displayed related to the target control instruction, and execute the target control operation.

Because each control instruction is generated according to the control data of the interactive interface control being displayed, each control instruction can be used for triggering the execution of target control operation on the interactive interface control being displayed related to the control instruction, namely each control instruction can be used for triggering the execution of target control operation on the interactive interface control being displayed, to which the control data of the control instruction is generated.

Further, the target control operation may be a control operation implemented in accordance with a target control manner indicated by the target control instruction.

Specifically, after the target control instruction is queried, the voice control device may perform control operation on the interactive interface control being displayed, to which the control data for generating the control instruction belongs, according to the target control mode indicated by the target control instruction.

In the embodiment of the present disclosure, optionally, after step S140, the voice control device may enter a new interactive interface, or may remain in the interactive interface being displayed.

In some embodiments, where the voice control device remains in the interactive interface being displayed, the voice control device may continue to implement voice control of the target interactive interface based on the target control instruction set without regenerating the control instruction set.

In other embodiments, in the case that the voice control device enters a new interactive interface, the voice control device needs to regenerate a control instruction set corresponding to the new interactive interface, so as to realize voice control of the user on the target interactive interface based on the regenerated control instruction set.

The multi-screen voice control method, the device, the equipment and the computer readable storage medium provided by the embodiment of the disclosure can utilize the real-time image to detect a target screen which is seen by a user who sends out real-time voice in a plurality of screens after receiving the real-time voice and the real-time image in the space to which the real-time voice belongs; after the target screen is determined, a target control instruction matched with real-time voice is searched in a target control instruction set corresponding to an interactive interface currently displayed by the target screen, and the target screen is controlled to execute the searched control instruction, so that when a user performs voice control on any screen in a visible and/or-speaking mode, the target screen which the user wants to perform voice control can be determined from a plurality of screens, misoperation situations such as no-screen response or inconsistent response screens and the user actually want to control screens are reduced, and user experience is improved.

Fig. 2 is a flowchart of another multi-screen voice control method according to an embodiment of the disclosure.

In an embodiment of the present disclosure, the multi-screen voice control method is performed by the voice control apparatus described above.

As shown in fig. 2, the multi-screen voice control method includes the steps of:

s210, receiving real-time voice and real-time images in the space where the real-time voice belongs.

In the embodiment of the present disclosure, this step is the same as step S110, and will not be described here again.

S220, based on the real-time voice, identifying a user image corresponding to the user which sends the real-time voice in the real-time image.

In the embodiment of the disclosure, the voice control device responds to the received real-time voice and the real-time image, determines a user who sends out the real-time voice through the real-time voice, and then determines a user image which sends out the real-time voice in the real-time image.

The user image may be image content including the user who uttered the real-time voice at the time when the user uttered the real-time voice.

In some embodiments, S220 may specifically include: determining the sound source position of real-time voice; identifying image content corresponding to the sound source position in the real-time image; the image content is taken as a user image.

The voice control device uses a sound source positioning method to determine the sound source position of the real-time voice, namely the user position, according to the real-time voice sent by the user.

In some embodiments, when an audio collection device is installed in the space, the audio collection device may collect real-time voices of all the region positions in the space, and the correspondence between the directions and the distances between the audio collection device and the plurality of region positions is preset. The sound source localization method may be that the voice control device measures the sound signals of the real-time voice sent by the user at different area positions in the space by using the audio acquisition device, and the measured sound signals of the real-time voice are processed by using an algorithm due to different degrees of delay of the time when the sound signals of the real-time voice sent by the user at different area positions reach the audio acquisition device, so as to obtain the arrival direction (including azimuth angle and pitch angle) and distance of the area positions of the sound signals of the real-time voice relative to the audio acquisition device, and determine the sound source position according to the preset corresponding relation of the direction and distance between the audio acquisition device and a plurality of area positions.

Taking a space as a cabin of a vehicle as an example, 1 microphone is arranged in the cabin of the vehicle, and the microphone can acquire real-time voices of a main driving position area, a co-driving position area, a rear left side position area and a rear right side position area in the cabin of the vehicle in real time.

For example, when a user in the main driving position area emits real-time voice, a microphone installed in the cockpit collects the real-time voice and sends the real-time voice to voice control equipment, the voice control equipment receives the real-time voice to obtain the direction and distance of the position area of a voice signal of the real-time voice relative to the microphone, and according to the obtained direction and distance and the preset corresponding relation between the direction and distance of the microphone and the main driving position area, the user emitting the real-time voice can be determined to be in the main driving position area, and then the main driving position area is determined to be the sound source position.

In other embodiments, when a plurality of audio collection devices are installed in the space, each audio collection device has a collection range, one collection range may correspond to one location area, and a correspondence relationship between each audio collection device and its corresponding location area is preset. The sound source positioning method may be that each audio collecting device collects the sound signals of the real-time voices sent by the users in the corresponding position areas in the space, meanwhile, shields the sound signals of the real-time voices in other position areas, sends the collected sound signals to the voice control device, and after receiving the sound signals sent by the audio collecting device, the voice control device takes the position area corresponding to the audio collecting device as the sound source position according to the corresponding relation between the audio collecting device and the corresponding position area.

Taking the space as the cabin of the vehicle as an example, 4 microphones, namely a microphone corresponding to a main driving position area, a microphone corresponding to a co-driving position area, a microphone corresponding to a rear left position area and a microphone corresponding to a rear right position area, are arranged in the cabin of the vehicle. Each microphone is responsible for collecting real-time voice uttered by a user in a respective corresponding location area.

For example, when a user in the main driving position area sends out real-time voice, a microphone installed in the main driving position area collects the real-time voice and sends the real-time voice to the voice control equipment, the voice control equipment receives the real-time voice sent by the microphone in the main driving position area, and according to the corresponding relation between the preset microphone and the corresponding position area, the user sending out the real-time voice is determined to be in the main driving position area, and then the main driving position area is determined to be the sound source position.

Further, after the voice control device determines the sound source position, the image content corresponding to the sound source position is identified in the real-time image.

In the visible or speaking scene, since the screen to which the user looks needs to be determined according to the user image, after the voice control device determines the sound source position, the image content corresponding to the sound source position needs to be identified in the real-time image first, and then the user image is determined according to the image content corresponding to the sound source position.

In some embodiments, when an image acquisition device is installed in the space, the image acquisition device may acquire real-time images of all the region positions in the space, and the voice control device identifies image contents corresponding to the sound source positions in the real-time images in the space in response to the received real-time images. When an image acquisition device and an audio acquisition device are installed in the space, the acquisition angles, the acquisition ranges and the like of the image acquisition device and the audio acquisition device are the same, then the image position and the image depth in the real-time image are matched with the sound source position, a user at the sound source position is identified, and then the image content of the user can be identified from the real-time image.

Taking the space as the cabin of the vehicle as an example, the cabin of the vehicle is fitted with a camera, for example mounted on the windshield of the cabin. When a user of the main driving position area emits real-time sound, the voice control equipment determines the main driving position area as a sound source position, and then identifies image content corresponding to the main driving position area in a real-time image acquired by the camera.

In other embodiments, when a plurality of image capturing devices are installed in a space, each image capturing device has a capturing range, one capturing range corresponds to one location area in the space, and a corresponding relationship between each image capturing device and the corresponding location area is preset. When the voice control device determines the sound source position, the image acquisition device controlling the sound source position transmits the real-time image acquired by the voice control device, and the voice control device receives the real-time image and takes the real-time image as the image content corresponding to the sound source position.

Taking a space as a cabin of a vehicle as an example, 4 cameras are installed in the cabin of the vehicle, for example, a camera corresponding to a main driving position area, a camera corresponding to a co-driving position area, a camera corresponding to a rear left side position area and a camera corresponding to a rear right side position area, and the cameras respectively corresponding to the cameras are responsible for acquiring images of users in the respective corresponding position areas. When a user in the main driving position area makes a sound, the voice control device determines that the main driving position area is a sound source position, controls a camera corresponding to the main driving position area to send a real-time image acquired by the camera according to the corresponding relation between the image acquisition device and the corresponding position area, and takes the real-time image as image content corresponding to the sound source position.

Further, after the voice control device identifies the image content, the identified image content is used as the user image.

In other embodiments, S220 may specifically include: determining the sound source position of real-time voice; identifying image content corresponding to the sound source position in the real-time image; a user image with a sound producing action is identified in the image content.

Specifically, the method for determining the sound source position of the real-time voice and identifying the image content corresponding to the sound source position in the real-time image by the voice control device is the same as the method for determining the sound source position of the real-time voice and identifying the image content corresponding to the sound source position in the real-time image described above, and will not be described herein.

In some embodiments, when the voice control device identifies a plurality of users in the image content corresponding to the sound source position, the voice control device cannot determine which user made the sound, and thus cannot determine the user image in the image content. It is therefore necessary to find the user with the sounding action in the real-time image and thus determine the user image.

Taking a cabin of a vehicle as an example, when a plurality of persons are carried in the cabin of the vehicle and the distance between the users in the back row is relatively short, when the users in the back row position area send out real-time voice, the voice control device receives the real-time voice, and the sound source position can be positioned to be the back row position area (for example, positioned to the left side of the back row) through a sound source positioning method, but after the image content corresponding to the sound source position is identified in the real-time image of the voice device, a plurality of users (for example, two users) are identified in the image content, so that the voice control device cannot determine the user image of the user sending out the real-time voice, and therefore, the user image corresponding to the user with the sounding action needs to be identified in the image content corresponding to the sound source position, so that the user sending out the real-time voice can be further determined, and the screen seen by the user can be determined.

In other embodiments, when the voice control device identifies a user in the image content corresponding to the sound source position, the voice control device may also identify the user image with the sounding action in the image content corresponding to the sound source position, so as to ensure that the user image can be determined more accurately, which is not limited herein.

In still other embodiments, S220 may specifically include: determining audio features of real-time speech; determining user characteristics for making real-time speech according to the audio characteristics; user images matching the user features are identified in the real-time image.

In particular, the audio features may be perceptual features in the audio type, which may be used to distinguish between sound features of the user, e.g. the gender of the user from which the real-time speech is made, whether the user from which the real-time speech is made is a child, etc., based on the sound features, so that based on the sound features, the user features may be determined, and further the user image matching the user features may be identified in the real-time image.

Taking the cabin of the vehicle as an example, the voice control apparatus recognizes that the vehicle has a man and a woman from the real-time image in response to the received real-time voice and the real-time image. By recognizing the audio signal in the real-time voice, the real-time voice is recognized to have the characteristics of low audio vibration frequency, low sound emission, long sound transmission and the like, and according to the characteristics, the user who emits the real-time voice is determined to be a man, so that the image content of the man is recognized from the real-time image, and the image content is taken as the user image.

It should be noted that, the above description of identifying the user image corresponding to the user making the real-time voice in the real-time image is only a few possible solutions, and in other embodiments, other possible methods may be adopted, or any combination of the foregoing possible implementations may be adopted, which is not limited herein.

S230, determining a target screen to which the user looks in a plurality of screens based on the user image.

In the embodiment of the present disclosure, the voice control apparatus determines a target screen from among a plurality of screens according to a screen to which a user displayed in a user image looks in response to a received user image.

Because the user generally looks at the screen to be controlled when making real-time voice in a visible and i.e. speaking scene, so as to ensure that the screen can meet the control requirement of the user, and when the user looks at the screen to be controlled, the voice control device generally has the actions of rotating the head, rotating the eyeballs and the like, so that the voice control device can determine the screen to be seen by the user according to the user head rotating image and the user eyeball rotating image displayed in the user image.

Specifically, in the plurality of screens, before the voice control apparatus determines the screen to which the user looks according to the user head rotation image and the eyeball rotation image displayed in the user image, it is necessary to set in advance the range values of the user head rotation angles of different sound source positions and the correspondence between the user eyeball rotation direction and each screen. Alternatively, in some examples, the correspondence may be integrated into an algorithm and/or model, or the like.

Taking a space as a cabin of a vehicle as an example, three screens are arranged in the cabin of the vehicle, and are respectively a central control screen corresponding to a main driving position area, a co-driving screen corresponding to a co-driving position area and a rear row screen corresponding to a rear row position area. The voice control device presets the sound source position as a main driving position area, and the range value of the head rotation angle of the user and the corresponding relation between the eyeball rotation direction and the main driving position areas of the central control screen, the assistant driving screen and the rear-row screen. For example, when the voice control device sets the head rotation angle of the user in the main driving position area to be 0-20 degrees and the eyeball rotation direction to be the right side, the user in the main driving position area looks at the central control screen; for another example, when the voice control apparatus sets the head rotation angle of the main driving position area user to 30 degrees to 70 degrees and the eyeball rotation direction to the right, the main driving position area user looks at the co-driver screen or the like. The voice control device sequentially sets sound source positions as a co-driving position area, a rear left position area and a rear right position area, and the range value of the user head rotation angle and the corresponding relation between the user eyeball rotation direction and the central control screen, the co-driving screen and the rear screen.

The voice control device calculates a rotation angle value of the user head and a rotation direction of the eyeball according to the rotation image of the user head and the rotation image of the eyeball displayed in the user image.

According to the range value of the rotation angle of the user head of the preset sound source position and the corresponding relation between the eyeball rotation direction and each screen, the calculated range of the rotation angle value of the user head and the rotation direction of the eyeball is determined, and then the screen to which the user looks can be determined.

Taking a space as a cabin of a vehicle as an example, for example, a voice control device firstly determines a sound source position as a main driving position area according to real-time voice, determines a user image corresponding to the main driving position area according to the sound source position, determines that the rotation angle of the user head of the main driving position area is 10 degrees from the user image, and the eyeball rotation direction is the right side; according to the preset sound source position as a main driving position area, the range value of the head rotation angle of the user and the corresponding relation between the eyeball rotation direction and the central control screen, the auxiliary driving screen and the rear-row screen, the screen seen by the user in the user image is determined to be the central control screen, and therefore the central control screen is determined to be the target screen.

In some embodiments, the corresponding relation between the different sound source positions and the range of the rotation angle of the head of the user may be set, or the corresponding relation between the different sound source positions and the eyeball rotation direction of the user may be set, which is not limited herein.

S240, inquiring a target control instruction matched with the real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to control data of the interactive interface being displayed by the target screen.

In the embodiment of the present disclosure, this step is the same as the step S130 described above, and will not be described herein.

S250, if the target control instruction is queried, controlling the target screen to execute target control operation corresponding to the target control instruction.

In the embodiment of the present disclosure, this step is the same as the step S140, and will not be described herein.

According to the embodiment of the disclosure, the sound source position for emitting the real-time voice is determined according to the real-time voice, and/or the user and/or the audio characteristics with the pronunciation action are/is identified, so that the user image can be determined according to the sound source position and/or the user characteristics with the pronunciation action, the target screen can be determined according to the user image, the target screen can be determined more accurately, the conditions that no screen responds or the response screen does not accord with the actual wanted control screen are reduced, and the user experience is improved.

Fig. 3 is a flowchart of another multi-screen voice control method according to an embodiment of the present disclosure, where the multi-screen voice control method is performed by the voice control device.

As shown in fig. 3, the multi-screen voice control method mainly includes the following steps:

s310, receiving real-time voice and real-time images in the space where the real-time voice belongs.

In the embodiment of the present disclosure, this step is the same as the step S110 described above, and will not be described herein.

S320, based on the real-time voice, identifying a user image corresponding to the user which sends the real-time voice in the real-time image.

Specifically, in the embodiment of the disclosure, the voice control device determines a sound source position of a user who utters the real-time voice in the real-time image in response to the received real-time voice and the real-time image, and determines a user image corresponding to the user who utters the real-time voice according to the sound source position.

In some embodiments, when multiple audio capture devices and multiple image capture devices are installed in a space, each audio capture device has a capture range, one capture range corresponding to each location area in the space.

The corresponding relation between each audio voice control device and each image acquisition device is preset. The voice control equipment receives the real-time voice collected by each audio collection equipment and the real-time image collected by each image collection equipment, and determines the user image according to the preset corresponding relation between the audio collection equipment and the image collection equipment.

Taking a space as a cabin of a vehicle as an example, a microphone and a camera corresponding to a main driving position area, a microphone and a camera corresponding to a co-driving position area, a microphone and a camera corresponding to a rear left side area and a microphone and a camera corresponding to a rear right side area are installed in the cabin of the vehicle, and the microphones and the cameras in the same position area have corresponding relations.

When a user sends out real-time voice, a microphone corresponding to the position area receives the real-time voice and sends the real-time voice to voice control equipment, and after receiving the real-time voice, the voice control equipment finds an image shot by a camera corresponding to the microphone according to the preset corresponding relation between the microphone and the camera and receives the image shot by the camera as a user image.

S330, based on the user image, the viewing direction of the user is identified.

Specifically, after the voice control apparatus determines the user image, the user head rotation image and the user eyeball rotation image displayed in the user image are recognized. And determining a user head rotation angle value and an eyeball rotation direction according to the head rotation image and the user eyeball rotation image.

S340, determining the position of the user according to the user image.

Specifically, after the voice control device determines the user image, the position of the user in the user image in the real-time image is identified in the real-time image, so that the position of the user in the space where the real-time voice belongs is determined.

Taking a space as an example of a cabin of a vehicle, besides the image acquisition equipment installed in each position area, an image acquisition equipment is also installed in the cabin of the vehicle and is responsible for acquiring real-time images of all ranges in the space, the positions of users who send real-time voices in the real-time images are displayed in the real-time images acquired by the audio acquisition equipment, and the positions of the users in the real-time images can be determined according to the positions displayed in the real-time images. Further, the position of the user in the space is determined according to the position of the user in the real-time image.

Taking the space as the cabin of the vehicle as an example, for example, the user making real-time voice is displayed in the real-time image in the main driving position area, so that the position of the user in the space can be determined as the main driving position area.

S350, determining a target screen in the multiple screens according to the position of the user and the viewing direction of the user.

Specifically, the voice control apparatus determines a target screen from among a plurality of screens according to a position where a user is located and a viewing direction of the user.

The voice control apparatus may determine the viewing direction of the user from the user head rotation image and the user eye rotation image displayed in the user image.

In the plurality of screens, before the voice control apparatus determines the viewing direction of the user from the user head rotation image and the eyeball rotation image displayed in the user image, it is necessary to set in advance the range values of the user head rotation angles at different positions where the user is located and the correspondence between the eyeball rotation direction of the user and each screen.

Taking a space as a cabin of a vehicle as an example, three screens are arranged in the cabin of the vehicle, and are respectively a central control screen corresponding to a main driving position area, a co-driving screen corresponding to a co-driving position area and a rear row screen corresponding to a rear row position area. The location of the user in the cabin of the vehicle may be a primary driving location area, a secondary driving location area, a rear left location area, a rear right location area. Therefore, when the voice control device presets the sound source position as the main driving position area, the range value of the head rotation angle of the user and the corresponding relation between the eyeball rotation direction and the main driving position areas of the central control screen, the assistant driving screen and the rear-row screen. For example, when the voice control device sets the head rotation angle of the user in the main driving position area to be 0-20 degrees and the eyeball rotation direction to be the right side, the user in the main driving position area looks at the central control screen; for another example, when the voice control apparatus sets the head rotation angle of the main driving position area user to 30 degrees to 70 degrees and the eyeball rotation direction to the right, the main driving position area user looks at the co-driver screen or the like. A copilot position area, a rear left side position area and a rear right side position area are sequentially arranged, the range value of the user head rotation angle and the corresponding relation between the user eyeball rotation direction and the central control screen, the assistant driving screen and the rear row screen.

The voice control device calculates the rotation angle value of the user head and the rotation direction of the eyeball according to the rotation image of the user head and the rotation image of the eyeball displayed in the user image, and then determines a target screen according to the position of the user.

Taking a space as a cabin of a vehicle as an example, for example, the voice control device determines that the rotation angle of the head of the user is 10 degrees and the eyeball rotation direction is right, and the voice control device determines that the user is in the main driving position region, so that according to the preset range value of the rotation angle of the head of the user and the corresponding relation between the eyeball rotation direction of the user and each screen, the screen corresponding to the main driving position region can be determined, and the screen corresponding to the main driving position region is taken as the target screen.

S360, inquiring a target control instruction matched with the real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to control data of the interactive interface being displayed by the target screen.

And S370, if the target control instruction is queried, controlling the target screen to execute the target control operation corresponding to the target control instruction.

According to the embodiment of the disclosure, the user image of the user making the real-time voice is identified in the real-time image, and the viewing direction of the user and the position of the user are determined according to the user image, so that the user can interact with the screen to be controlled in a visible and speaking manner, and the user experience is improved.

Fig. 4 is a schematic structural diagram of a multimedia voice control device according to an embodiment of the disclosure. The multimedia voice control apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the multimedia voice control method, as shown in fig. 4, where the multimedia voice control apparatus 40 includes:

the receiving module 41 is configured to receive real-time voice and a real-time image in a space to which the real-time voice belongs.

The detection module 42 is configured to detect a target screen, which is viewed by a user who utters real-time voice, among a plurality of screens based on the real-time image.

And the query module 43 is configured to query a target control instruction set corresponding to the target screen for a target control instruction matched with the real-time voice, where the target control instruction set includes a control instruction generated according to control data of an interactive interface being displayed by the target screen.

And the control module 44 is configured to control the target screen to execute the target control operation corresponding to the target control instruction if the target control instruction is queried.

Optionally, the detection module 42 further includes a first recognition unit and a determining unit, where the first recognition unit is configured to recognize, in the real-time image, a user image corresponding to a user who utters the real-time voice based on the real-time voice; the determination unit is configured to determine a target screen to which the user looks among the plurality of screens based on the user image.

Optionally, the first recognition unit includes a first determination subunit and a first recognition subunit, where the first determination subunit is configured to determine a sound source position of the real-time voice; the first identification subunit is used for identifying image contents corresponding to the sound source position in the real-time image; the image content is taken as a user image.

Optionally, the detection module 42 further includes a second identification unit. Wherein the second recognition unit is used for recognizing the user image with the sounding action in the real-time image.

Optionally, the second identifying unit includes a second determining subunit and a second identifying subunit. The second determining subunit is used for determining the sound source position of the real-time voice; the second identification subunit is used for identifying image contents corresponding to the sound source position in the real-time image; a user image with a sound producing action is identified in the image content.

Optionally, the determining unit includes a third identifying subunit and a third determining subunit, where the third identifying subunit is configured to identify a viewing direction of the user based on the user image; the third determination subunit is configured to determine, among the plurality of screens, a target screen according to a viewing direction of the user.

Optionally, the query module 43 is configured to query, in a target control instruction set corresponding to the target screen, a target control instruction matched with the real-time voice, specifically configured to: converting real-time voice into voice text; and inquiring the target control instruction matched with the voice text in the target control instruction set.

Optionally, the control instructions include control instructions generated from static control data and/or dynamic control data in the interface control data.

The multimedia voice control device of the embodiment shown in fig. 4 may be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effects are similar, and are not repeated here.

Fig. 5 is a schematic structural diagram of a multi-screen voice control apparatus according to an embodiment of the present disclosure. The multi-screen voice control apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the multi-screen voice control method, as shown in fig. 5, the apparatus 50 includes: memory 51, processor 52, computer programs and communication interface 53; wherein a computer program is stored in the memory 51 and configured to be executed by the processor 52 for the multi-screen speech control method as described above.

In addition, the embodiment of the present disclosure also provides a computer-readable storage medium having stored thereon a computer program that is executed by a processor to implement the multi-screen voice control method described in the above embodiment.

Furthermore, the disclosed embodiments also provide a computer program product comprising a computer program or instructions which, when executed by a processor, implements a multi-screen speech control method as described above.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-screen speech control method, the method comprising:

receiving real-time voice and a real-time image in a space where the real-time voice belongs;

detecting a target screen to which a user who utters the real-time voice looks in a plurality of screens based on the real-time image;

inquiring a target control instruction matched with the real-time voice in a target control instruction set corresponding to the target screen, wherein the target control instruction set comprises control instructions generated according to control data of an interactive interface being displayed by the target screen;

and if the target control instruction is queried, controlling the target screen to execute target control operation corresponding to the target control instruction.

2. The method of claim 1, wherein the detecting a target screen from among a plurality of screens to which a user who uttered the real-time voice looks based on the real-time image comprises:

identifying a user image corresponding to a user who sends the real-time voice in the real-time image based on the real-time voice;

based on the user image, a target screen to which the user looks is determined among a plurality of screens.

3. The method of claim 2, wherein the identifying, in the real-time image, a user image corresponding to a user who uttered the real-time voice based on the real-time voice, comprises:

determining a sound source position of the real-time voice;

identifying image content corresponding to the sound source position in the real-time image;

and taking the image content as the user image.

4. The method of claim 1, wherein the detecting a target screen from among a plurality of screens to which a user who uttered the real-time voice looks based on the real-time image comprises:

identifying a user image with sound production action in the real-time image;

5. The method of claim 4, wherein identifying a user image with a sound action in the real-time image comprises:

determining a sound source position of the real-time voice;

a user image with a sound producing action is identified in the image content.

6. The method of claim 2 or 4, wherein the determining a target screen to which the user looks among a plurality of screens based on the real-time image comprises:

identifying a viewing direction of the user based on the user image;

determining the position of the user according to the user image;

and determining the target screen in a plurality of screens according to the position of the user and the viewing direction of the user.

7. The method according to claim 1, wherein the querying the target control command matched with the real-time voice in the target control command set corresponding to the target screen includes:

converting the real-time voice into voice text;

and inquiring the target control instruction matched with the voice text in the target control instruction set.

8. The method of claim 1, wherein the control instruction comprises:

and generating control instructions according to static control data and/or dynamic control data in the interface control data.

9. A multi-screen speech control apparatus, the apparatus comprising:

the receiving module is used for receiving the real-time voice and the real-time image in the space to which the real-time voice belongs;

the detection module is used for detecting a target screen seen by a user who emits the real-time voice in a plurality of screens based on the real-time image;

the query module is used for querying a target control instruction set corresponding to the target screen, wherein the target control instruction set is matched with the real-time voice and comprises control instructions generated according to control data of an interactive interface being displayed by the target screen;

and the control module is used for controlling the target screen to execute target control operation corresponding to the target control instruction if the target control instruction is queried.

10. A multi-screen speech control apparatus, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the multi-screen speech control method of any of the preceding claims 1-8.

11. A computer readable storage medium, characterized in that the storage medium stores a computer program, which when executed by a processor causes the processor to implement the multi-screen speech control method of any of the preceding claims 1-8.