CN112652302B

CN112652302B - Voice control method, device, terminal and storage medium

Info

Publication number: CN112652302B
Application number: CN201910972320.6A
Authority: CN
Inventors: 陈泽钦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2024-05-24
Anticipated expiration: 2039-10-12
Also published as: CN112652302A

Abstract

The embodiment of the invention discloses a voice control method, a device, a terminal and a medium, which are applied to an applet engine of an applet operation platform operated by vehicle-mounted central control equipment, wherein the method comprises the following steps: determining the voice content of a first voice control instruction when the first voice control instruction of a first interface displayed for the first applet is detected; inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content; inquiring a response event set of the first interface according to the target wake-up word, determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling the video user interface operation module and comprises a second corresponding relation between the wake-up word of the component and the response event; executing a target response event; according to the embodiment of the invention, the corresponding component can be quickly positioned according to the voice control instruction, and the bound response event is triggered, so that the quick voice control function is realized.

Description

Voice control method, device, terminal and storage medium

Technical Field

The present invention relates to the field of speech technology, and in particular, to the field of speech processing technology, and more particularly, to a speech control method, a speech control device, a terminal, and a computer storage medium.

Background

With the increasing popularity of applets, users can apply to various APP, vehicle-mounted systems or smart home systems, and the applets are formed by a large number of components (view, text, button …) in pages, and can interact through gesture operations (such as clicking) of the users. The inventors found in practice that: the existing voice control method can lead to 1) long time consumption for voice recognition; 2) The occupation of performance resources is high; from voice input to determination of a target component and response initiation, 2-3 seconds are often required, and the interactive experience of voice control of a user is seriously affected.

Disclosure of Invention

The embodiment of the invention provides a voice control method, a voice control device, a voice control terminal and a computer storage medium, which can quickly locate a corresponding component according to a voice control instruction and trigger a bound response event, thereby realizing a quick voice control function.

In one aspect, an embodiment of the present invention provides a voice control method, which is applied to an applet engine of an applet operation platform operated by a vehicle-mounted central control device, where the applet engine is provided with a video user interface operation module, and the applet operation platform operates with a first applet, and the voice control method includes:

Determining the voice content of a first voice control instruction for a first interface displayed by the first applet when the first voice control instruction is detected, wherein the first interface comprises at least one component;

Inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and wake-up words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in original control logic of the first interface;

Inquiring a response event set of the first interface according to the target wake-up word, and determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling the video user interface operation module and comprises a second corresponding relation between the wake-up word of the component and the response event;

And executing the target response event.

The at least one component comprises at least one foreground component and/or at least one background component, the foreground component is a component displayed on the first interface, and the background component is a component not displayed on the first interface.

Wherein a single reference component in the first correspondence corresponds to one or more wake words.

Wherein the determining the voice content of the first voice control instruction includes:

Acquiring a preset voice model;

And determining the voice content of the first voice control instruction according to the preset voice model.

Wherein, before the detecting the first voice control instruction for the first interface displayed by the first applet, the method further comprises:

The following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first applet:

Acquiring at least one component of the first interface;

Traversing the at least one component;

Determining one or more reference components of the at least one component to which the responsive event is respectively bound;

generating at least one reference component from the one or more components;

generating a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component, and obtaining the wake-up word set.

Wherein the generating at least one reference component from the one or more components comprises:

and determining that each component in the one or more components is a reference component, and obtaining at least one reference component.

determining a usage record for each of the one or more components;

determining the number of times of use of each component according to the use record of each component;

and determining the components with the use times larger than the preset use times in the one or more components as reference components to obtain at least one reference component.

Wherein,

Identifying descriptive information for each of the reference components;

performing word segmentation processing on the description information to obtain a wake-up word subset corresponding to each component;

and establishing a corresponding relation between the wake-up word subset corresponding to each component and the corresponding component to obtain the wake-up word set.

On the other hand, the embodiment of the invention provides a voice control device, which is applied to vehicle-mounted central control equipment, and comprises:

A determining unit configured to determine, when a first voice control instruction for a first interface displayed by the first applet is detected, a voice content of the first voice control instruction, the first interface including at least one component;

The determining unit is further configured to query, according to the voice content, a wake-up word set of the first interface, and determine a target wake-up word corresponding to the voice content, where the wake-up word set includes a first correspondence between a reference component and wake-up words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in original control logic of the first interface;

The determining unit is further configured to query a response event set of the first interface according to the target wake-up word, determine a target response event corresponding to the target wake-up word, where the response event set is generated by calling the video user interface operation module, and the response event set includes a second correspondence between the wake-up word of the component and the response event;

and the execution unit is used for executing the target response event.

Wherein, the determining the voice content of the first voice control instruction is specifically configured to:

Acquiring a preset voice model;

The device further comprises a generating unit, before the first voice control instruction of the first interface displayed for the first applet is detected, wherein the generating unit is specifically configured to:

Acquiring at least one component of the first interface;

Traversing the at least one component;

generating at least one reference component from the one or more components;

Wherein the generating unit is specifically configured to generate at least one reference component according to the one or more components:

determining a usage record for each of the one or more components;

The generating unit is specifically configured to generate a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component to obtain the wake-up word set:

Identifying descriptive information for each of the reference components;

In still another aspect, an embodiment of the present invention provides a terminal, where the terminal includes an input device and an output device, and the terminal further includes:

A processor adapted to implement one or more instructions; and

A computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

And executing the target response event.

In yet another aspect, embodiments of the present invention provide a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of:

And executing the target response event.

When a first voice control instruction of a first interface displayed for the first applet is detected, determining voice content of the first voice control instruction, wherein the first interface comprises at least one component, secondly, inquiring a wake-up word set of the first interface according to the voice content, determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and the wake-up word, the reference component is a component with response events bound in the at least one component, the reference component only supports touch operation functions in original control logic of the first interface, then, inquiring the response event set of the first interface according to the target wake-up word, determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling a video user interface operation module, and comprises a second corresponding relation between wake-up words of the components and the response event, and finally, executing the target response event. Therefore, the wake-up word can be quickly positioned according to the voice control instruction, the component corresponding to the wake-up word is inquired, and the bound response event is triggered, so that the quick voice control function is realized, the time consumption of voice recognition is reduced, the occupied performance resource is low, and the high efficiency and the intelligence of the voice control method are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a communication system according to an embodiment of the present invention;

Fig. 2a is a schematic flow chart of a voice control method according to an embodiment of the present invention;

FIG. 2b is a schematic illustration of an interface provided by an embodiment of the present invention;

FIG. 3a is a schematic diagram of a software architecture according to an embodiment of the present invention;

FIG. 3b is a response flow chart provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a voice control method according to another embodiment of the present invention;

FIG. 5 is a flowchart of a voice control method according to another embodiment of the present invention;

FIG. 6a is a schematic illustration of an interface provided by an embodiment of the present invention;

FIG. 6b is a schematic illustration of another interface provided by an embodiment of the present invention;

FIG. 6c is a schematic diagram of a voice input provided by an embodiment of the present invention;

FIG. 6d is a schematic diagram of an interface of a playback page according to an embodiment of the present invention;

fig. 7 is an interface schematic diagram of a vehicle-mounted central control device according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The embodiment of the invention provides a voice control scheme which can quickly position a corresponding component according to a voice control instruction and trigger a bound response event, thereby realizing a quick voice control function; the voice control scheme can be applied to vehicle-mounted central control equipment and other equipment, wherein the other equipment can comprise but is not limited to: smart phones, tablet computers, laptop computers, desktop computers, and the like. The vehicle-mounted central control equipment can be connected with a server and is in interactive communication with the server. The voice control scheme can be executed in the corresponding applet of the vehicle-mounted central control equipment according to the actual service requirement. For example, an applet may be run in the in-vehicle central control device to control the user's voice commands.

The following describes a speech control scheme according to an embodiment of the present invention, taking the speech control scheme as an example of applying the speech control scheme to a communication system as shown in fig. 1, and invoking a vehicle-mounted applet to execute the speech control scheme: as shown in fig. 1, after detecting the voice input of the user, the vehicle-mounted central control device can be matched with the content in a certain component in the current page through the voice control instruction, so that the interaction operation of the component can be responded quickly; the content in the current page is obtained through interactive communication between the vehicle-mounted central control equipment and the server.

Based on the above description, the embodiment of the invention provides a voice control method, which is applied to an applet engine of an applet operation platform operated by a vehicle-mounted central control device, wherein the applet engine is provided with a video user interface operation module, and the applet operation platform is operated with a first applet. Referring to fig. 2a, the voice control method may include the following steps S201 to S204:

s201, when a first voice control instruction of a first interface displayed for the first applet is detected, determining voice content of the first voice control instruction, wherein the first interface comprises at least one component.

The voice content may include interface content displayed on the current first interface or content not displayed on the current first interface, which is not limited herein.

When a first voice control instruction of a first interface is detected, voice content can be acquired from the preloaded first interface; or the voice content is acquired on the first interface which is not preloaded.

S202, inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and wake-up words, the reference component is a component with response events bound in at least one component, and the reference component only supports touch operation functions in original control logic of the first interface.

If the first interface of the current in-vehicle middle frame device is detected, as shown in fig. 2b, a corresponding wake-up word set in fig. 2 b: { moon, legend, lightning, legend, cloud, fly, me heart, fly, heart, corresponding, today, very fun, me, just laugh, sun, fly }; then, the first correspondence between the reference component and the wake word may be: moon legend-moon legend; lightning legend-lightning legend; cloud fluttering-cloud fluttering; the heart flies-the heart flies; corresponding to the heart-heart corresponding to the above; very fun today- -very fun today; i can not laugh-I can not laugh, or II can not laugh; sun fly-sun fly.

The original control logic refers to an initial control policy of a first interface pushed by a server, or an initial control policy of an original APP to which the first interface belongs. When the first interface is specifically supported by the applet platform, the applet platform gives the voice control capability to the components in the first interface. The touch operation function is used for receiving a touch operation instruction of a user, determining a corresponding component according to the touch operation instruction, and executing a response event corresponding to the component.

For example, as described in fig. 2b above, the in-vehicle central control device detects a touch command for "moon legend", and enters the display interface of "moon legend" according to the touch command.

The first correspondence may be one-to-one, one-to-many, or many-to-many, which is not limited herein.

The component bound with the response event can be rapidly executed without traversing the component again.

S203, inquiring a response event set of the first interface according to the target wake-up word, and determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling the video user interface operation module, and comprises a second corresponding relation between the wake-up word of the component and the response event.

S204, executing the target response event.

Fig. 3a is a software architecture diagram according to an embodiment of the present application. In the present application, the above method can be implemented by a software architecture diagram shown in fig. 3a, which mainly involves three modules, namely: a video user interface running module (VUI run), a voice processing module (SKILL HANDLER), a voice wake-up free module. By integrating the three modules in the applet framework, visual and speech control operation can be realized for any applet. The modules involved in the present speech control scheme are described as follows:

VUI run module: responsible for maintaining and updating active components in the applet's current visible area;

the voice wake-up-free module: collecting the latest wake-up words and storing the latest wake-up words into a list, which is used for carrying out acoustic matching on the input voice and outputting wake-up words matched with the input voice signals;

SKILL HANDLER module: and the event matching with the wake-up word is decided from the list, the event type and the response function are taken out, the corresponding event is issued, and finally the operation bound by the component can be triggered.

The response flow chart corresponding to the software framework chart may be shown in fig. 3b, and fig. 3b is a response flow chart provided by an embodiment of the present invention. The specific flow of the three modules is as follows:

Starting an applet, searching a visible area in a page in the VUI run time module, binding a component with response time, generating information of a voice operation node list (VUI NodeList), creating monitors such as a page change monitor (DOM (browser)) and a page Scroll monitor (Scroll browser), namely, immediately re-searching the component in the current visible area after the page content is changed or a user scrolls the page, and updating the voice operation node list; and performing word segmentation processing on text contents in the component according to the list of the voice operation nodes to obtain a plurality of wake-up words, transmitting the wake-up words to the voice wake-up-free module, and synchronizing the list of the voice operation nodes to the SKILL HANDLER module.

Wherein, in the page storing the current visible area in the VUI nodebist, a list of all node (component) information that can perform interactive operation includes: text content in the component, the name of the event that is bound, the location/size of the component presentation, etc.

In the voice wake-up-free module, a plurality of wake-up words transmitted by the VUI run time module are received, and the wake-up words are stored as a list; when the input of the user voice is detected, the input voice is subjected to acoustic matching, and a target wake-up word matched with the input voice signal is output to a SKILL HANDLER module.

The wake-up word can be obtained according to a wake-up technology, the wake-up technology directly judges whether to wake up or not through matching of signals and an acoustic model, belongs to an offline algorithm, and is more efficient in speed than a voice recognition technology.

In the SKILL HANDLER module, a voice operation node list from the VUI run time module and a target wake-up word from the voice wake-up-free module are received, further, a target voice node matched with the wake-up word is decided from the VUI nodebist, the event type and the response function are taken out, a corresponding event is issued, and finally, the operation bound by the assembly can be triggered.

Fig. 4 is a flow chart of another voice control method according to an embodiment of the invention. The voice control method is applied to an applet engine of an applet operation platform operated by the vehicle-mounted central control equipment, wherein the applet engine is provided with a video user interface operation module, and the applet operation platform is operated with a first applet. As shown in fig. 4, the voice control method may include the following steps S401 to S407:

s401, acquiring at least one component of the first interface through the video user interface operation module in the process of loading and rendering the first interface by the first applet.

S402, traversing the at least one component.

S403, determining one or more reference components with response events respectively bound in the at least one component.

S404, generating at least one reference component according to the one or more components.

Accordingly, step S404 may include the following step S11:

And s11, determining that each component in the one or more components is a reference component, and obtaining at least one reference component.

In the example, each component is set as a reference component, so that the reference component can be comprehensively obtained, the condition that the component is lost or omitted is avoided, and the accuracy of the voice control method is improved.

Accordingly, step S404 may include the following steps S21-S23:

s21, determining a usage record for each of the one or more components.

S22, determining the number of times of use of each component according to the use record of each component.

And s23, determining the components with the use times larger than the preset use times in the one or more components as reference components, and obtaining at least one reference component.

The preset usage times can be set by a manufacturer or a user when leaving a factory, and are not limited only.

Optionally, in the determining the usage record of each of the one or more components, the usage record may be obtained periodically according to a usage time, where the usage time may be set to a preset time period, and the preset time period may be within a week or a month, which is not limited herein.

Optionally, the at least one component obtained by screening according to the number of use times may be set as a common component, that is, after the next initialization, an interface with the next startup is loaded simultaneously.

In this example, the high-frequency component is determined through the usage record of each component, that is, the component which is most likely to be awakened by the user is obtained according to the preference of the user, so that the intelligence of component generation is improved.

S405, generating a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component, and obtaining the wake-up word set.

Accordingly, step S404 may include the following steps S31-S33:

s31, identifying the description information of each component in the reference components.

S32, performing word segmentation processing on the description information to obtain a wake-up word subset corresponding to each component.

S33, establishing a corresponding relation between the wake-up word subset corresponding to each component and the corresponding component to obtain the wake-up word set.

The description information may be text content, a bound event name, a location and a size of a component display, and the like, which are not limited only herein.

The word segmentation process can be performed according to the constitution of morphemes, such as overlapping, adding and compounding; or processing according to word segmentation units, dictionary entries and grammar words, and is not limited only herein.

In this example, in the process of loading and rendering the first interface, the component is traversed in advance, which is beneficial to improving the rapidness and accuracy of the voice control method.

S406, when a first voice control instruction of a first interface displayed for the first applet is detected, determining voice content of the first voice control instruction, wherein the first interface comprises at least one component.

S407, inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and wake-up words, the reference component is a component with a response event bound in at least one component, and the reference component only supports a touch operation function in original control logic of the first interface.

S408, inquiring a response event set of the first interface according to the target wake-up word, and determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling the video user interface operation module, and comprises a second corresponding relation between the wake-up word of the component and the response event.

S409, executing the target response event.

In one embodiment, the at least one component includes at least one foreground component and/or at least one background component, the foreground component being a component displayed at the first interface and the background component being a component not displayed at the first interface.

The background component may be a general component, for example, the general component may be "previous page", "next page", "back", "return", or "close applet", which are not limited herein.

In this example, the default wake-up word can be quickly located according to the voice control instruction, the general component corresponding to the default wake-up word is queried, and the bound response event is triggered, so that the quick voice control function is realized, the time consumption of voice recognition is reduced, the occupied performance resource is low, and the high efficiency and the intelligence of the voice control method are improved.

In one embodiment, a single reference component in the first correspondence corresponds to one or more wake words.

Optionally, identifying text content of each component in the reference components, and performing word segmentation processing to obtain wake-up words corresponding to each component; and establishing a corresponding relation between the wake-up words corresponding to each component and the corresponding components to obtain the first corresponding relation.

In this example, according to the corresponding relationship between the reference component and the wake-up word, the corresponding event can be accurately executed after the target wake-up word is obtained, so that the timeliness and accuracy of the voice control method are improved.

In one embodiment, the determining the voice content of the first voice control instruction includes: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

The preset voice model is usually composed of an acoustic model and a language model, and corresponds to calculation of voice to syllable probability and syllable to word probability respectively.

In this example, the voice content input by the user is accurately obtained through the preset voice model, so that erroneous voice content recognition is avoided, and further, the operation is performed erroneously, so that the accuracy of the voice control method is improved.

The following describes the voice control method in detail by taking an applet, which is used in the vehicle-mounted central control device, as an example.

The applet herein may include, but is not limited to: game applets, video applets, audio applets and the like, wherein the video applets refer to video being played through the vehicle-mounted central control device, and the audio applets refer to audio being played through the vehicle-mounted central control device.

In the embodiment of the invention, the interactive components in the applet page are automatically identified, and when the text content is input by the user through voice, if the text content is matched with the content in a certain component in the current page, the interactive operation of the component can be quickly responded for illustration; the specific flow is shown in fig. 5:

When the on-vehicle central control equipment detects opening operation aiming at a first applet, network communication connection with a server is established, loading interface content of the current applet is obtained from the server in the process of loading and rendering the current interface of the first applet, the current first applet interface is identified, all components are traversed, text content in the components is taken out, voice input by a user is obtained, the component matched with voice text is found, and the interface of the component is jumped.

FIG. 6a is a schematic view of an interface when rendered in a state;

FIG. 6b is a schematic diagram of an interface for loading completion, as shown in FIG. 6 b; the components in the current interface are { moon legend, lightning legend, cloud floating, o champion, i love to read books, o sea, i just laugh, sun fly, close, return, next page, play }; further obtaining the current wake-up words { moon, legend, lightning, legend, cloud, flutter, o, champion, me, love, reading, me, laugh, sunlight, fly, o, sea, close, return, next page, play }, wherein in the current interface, "search" is to display a small program internal search box and change to an input state, and "view more" is to jump to a hot movie list page;

When the voice of the user is obtained, a wake-up word matched with the voice text is found, as shown in fig. 6c, the voice input by the user is obtained as "champion", the corresponding wake-up word is found as "champion", the corresponding component is "champion to dash over", namely, the champion to dash over is selected, and the user jumps to a play page of "champion to dash over", as shown in fig. 6d, and fig. 6d is an interface schematic diagram of the play page.

If the obtained voice of the user is "close the current interface", the corresponding wake-up word is found to be closed, and the wake-up word is the default wake-up word, the interface is jumped to the interface shown in fig. 7, wherein fig. 7 may be an interface diagram of the vehicle-mounted central control device after the applet is closed.

Based on the above description of the embodiments of the voice control method, the embodiments of the present invention also disclose a voice control apparatus, which may be a computer program (including program code) running in a terminal. The voice control apparatus may perform the method shown in fig. 2a or fig. 4. Referring to fig. 8, the voice control apparatus may operate the following units:

A determining unit 101, configured to determine, when a first voice control instruction for a first interface displayed by the first applet is detected, a voice content of the first voice control instruction, the first interface including at least one component;

The determining unit 101 is further configured to query, according to the voice content, a wake word set of the first interface, determine a target wake word corresponding to the voice content, where the wake word set includes a first correspondence between a reference component and wake words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in original control logic of the first interface;

The determining unit 101 is further configured to query, according to the target wake word, a response event set of the first interface, where the response event set is generated by calling the video user interface operation module, and the response event set includes a second correspondence between a wake word of a component and a response event;

and the execution unit 102 is used for executing the target response event.

In yet another embodiment, a single reference component in the first correspondence corresponds to one or more wake words.

In still another embodiment, the determining unit 101 is specifically configured to, when determining the voice content of the first voice control instruction: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

In yet another embodiment, the voice control device further includes a generating unit 103, specifically configured to, when the first voice control instruction for the first interface displayed by the first applet is detected,: the following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first applet: acquiring at least one component of the first interface; traversing the at least one component; determining one or more reference components of the at least one component to which the responsive event is respectively bound; generating at least one reference component from the one or more components; generating a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component, and obtaining the wake-up word set.

In yet another embodiment, the generating unit 103 is specifically configured to, when generating the at least one reference component according to the one or more components: and determining that each component in the one or more components is a reference component, and obtaining at least one reference component.

In yet another embodiment, the generating unit 103 is specifically configured to, when generating the at least one reference component according to the one or more components: determining a usage record for each of the one or more components; determining the number of times of use of each component according to the use record of each component; and determining the components with the use times larger than the preset use times in the one or more components as reference components to obtain at least one reference component.

In yet another embodiment, the execution unit 102 is specifically configured to, when the wake word subset of each reference component is generated according to the description information of each reference component in the at least one reference component, obtain the wake word set: identifying descriptive information for each of the reference components; performing word segmentation processing on the description information to obtain a wake-up word subset corresponding to each component; and establishing a corresponding relation between the wake-up word subset corresponding to each component and the corresponding component to obtain the wake-up word set.

According to one embodiment of the invention, the steps involved in the method of fig. 2a or fig. 4 may be performed by the units of the speech control device of fig. 8. For example, steps S201, S202, S203 shown in fig. 2a may be performed by the determining unit 101 shown in fig. 8, and step S204 may be performed by the executing unit 102 shown in fig. 8; as another example, steps S401 to S405 shown in fig. 4 may be performed by the generating unit 103 shown in fig. 8, steps S402 to S408 may be performed by the determining unit 101 shown in fig. 8, and S409 may be performed by the executing unit 102 shown in fig. 8.

According to another embodiment of the present invention, each unit in the voice control apparatus shown in fig. 8 may be separately or completely combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units having smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present invention. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present invention, the voice-based control apparatus may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present invention, a voice control apparatus device as shown in fig. 8 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2a or fig. 4 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and implementing the voice control method of the embodiment of the present invention. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed by the above-described computing device via the computer-readable recording medium.

Based on the description of the method embodiment and the device embodiment, the embodiment of the invention also provides a terminal. Referring to fig. 9, the terminal includes at least a processor 201, an input device 202, an output device 203, and a computer storage medium 204. Wherein the processor 201, input device 202, output device 203, and computer storage medium 204 within the terminal may be connected by a bus or other means.

The computer storage medium 204 may be stored in a memory of the terminal, the computer storage medium 204 being for storing a computer program comprising program instructions, the processor 201 being for executing the program instructions stored by the computer storage medium 204. The processor 201 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the terminal, which are adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; in one embodiment, the processor 201 according to the embodiment of the present invention may be configured to perform a series of voice control processes, including: determining the voice content of a first voice control instruction for a first interface displayed by the first applet when the first voice control instruction is detected, wherein the first interface comprises at least one component; inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and wake-up words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in original control logic of the first interface; inquiring a response event set of the first interface according to the target wake-up word, and determining a target response event corresponding to the target wake-up word, wherein the response event set is generated by calling the video user interface operation module and comprises a second corresponding relation between the wake-up word of the component and the response event; executing the target response event, and so on.

The embodiment of the invention also provides a computer storage medium (Memory), which is a Memory device in the terminal and is used for storing programs and data. It will be appreciated that the computer storage medium herein may include both a built-in storage medium in the terminal and an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 201. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by processor 201 to implement the respective steps of the methods described above in connection with the scene-cut embodiment; in particular implementations, one or more instructions in a computer storage medium are loaded by processor 201 and perform the steps of:

And executing the target response event.

In yet another embodiment, in the determining the voice content of the first voice control instruction, the one or more instructions may further be loaded and executed by the processor 201 to: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

In yet another embodiment, the one or more instructions may be further loaded and executed by the processor 201 before the detecting the first voice control instruction for the first interface displayed by the first applet: the following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first applet: acquiring at least one component of the first interface; traversing the at least one component; determining one or more reference components of the at least one component to which the responsive event is respectively bound; generating at least one reference component from the one or more components; generating a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component, and obtaining the wake-up word set.

In yet another embodiment, the one or more instructions may be further loaded and executed by the processor 201 in particular in generating the at least one reference component from the one or more components: and determining that each component in the one or more components is a reference component, and obtaining at least one reference component.

In yet another embodiment, the one or more instructions may be further loaded and executed by the processor 201 in particular in generating the at least one reference component from the one or more components: determining a usage record for each of the one or more components; determining the number of times of use of each component according to the use record of each component; and determining the components with the use times larger than the preset use times in the one or more components as reference components to obtain at least one reference component.

In yet another embodiment, the one or more instructions may further be loaded and executed by the processor 201 when the generating the wake word subset of each reference component according to the description information of each reference component of the at least one reference component, and obtaining the wake word set, specifically: identifying descriptive information for each of the reference components; performing word segmentation processing on the description information to obtain a wake-up word subset corresponding to each component; and establishing a corresponding relation between the wake-up word subset corresponding to each component and the corresponding component to obtain the wake-up word set.

The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims

1. A voice control method, characterized by an applet engine applied to an applet operating platform operated by a vehicle-mounted central control device, the applet engine being provided with a video user interface operating module for maintaining and updating active components in a current visible area of the applet operating platform, the applet operating platform operating a first applet, the method comprising:

Determining voice content of a first voice control instruction when the first voice control instruction aiming at a first interface displayed by the first applet is detected, wherein the first interface comprises at least one component, and the voice content comprises interface content displayed on a current first interface or content not displayed on the current first interface;

Inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and wake-up words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in original control logic of the first interface, wherein the original control logic is an initial control strategy of the first interface or an initial control strategy of an original application program to which the first interface belongs;

And executing the target response event.

2. The method of claim 1, wherein the at least one component comprises at least one foreground component and/or at least one background component, the foreground component being a component displayed at the first interface and the background component being a component not displayed at the first interface.

3. The method of claim 2, wherein a single reference component in the first correspondence corresponds to one or more wake words.

4. A method according to any one of claims 1-3, wherein said determining the voice content of the first voice-control instruction comprises:

Acquiring a preset voice model;

5. The method of claim 4, wherein prior to the detecting a first voice-control instruction for a first interface displayed by the first applet, the method further comprises:

Acquiring at least one component of the first interface;

Traversing the at least one component;

generating at least one reference component from the one or more components;

6. The method of claim 5, wherein the generating at least one reference component from the one or more components comprises:

7. The method of claim 5, wherein the generating at least one reference component from the one or more components comprises:

determining a usage record for each of the one or more components;

8. The method of claim 5, wherein the generating the wake word subset for each of the at least one reference component based on the description information for each reference component, to obtain the wake word set, comprises:

Identifying descriptive information for each of the reference components;

9. A speech control apparatus characterized by an applet engine for an applet operating platform for operation with a vehicle-mounted central control device, the applet engine being provided with a video user interface operating module for maintaining and updating active components in a current visible area of the applet operating platform, the applet operating platform operating a first applet, the apparatus comprising:

A determining unit configured to determine, when a first voice control instruction for a first interface displayed by the first applet is detected, voice content of the first voice control instruction, the first interface including at least one component, the voice content including interface content displayed on a current first interface or content not displayed on the current first interface;

The determining unit is further configured to query, according to the voice content, a wake word set of the first interface, determine a target wake word corresponding to the voice content, where the wake word set includes a first correspondence between a reference component and wake words, the reference component is a component in which a response event is bound in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface, where the original control logic is an initial control policy of the first interface or an initial control policy of an original application program to which the first interface belongs;

and the execution unit is used for executing the target response event.

10. A terminal comprising an input device and an output device, further comprising:

A processor adapted to implement one or more instructions; and

Computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the speech control method according to any of claims 1-8.

11. A computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the speech control method according to any one of claims 1-8.