CN112652302A

CN112652302A - Voice control method, device, terminal and storage medium

Info

Publication number: CN112652302A
Application number: CN201910972320.6A
Authority: CN
Inventors: 陈泽钦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2021-04-13

Abstract

The embodiment of the invention discloses a voice control method, a voice control device, a voice control terminal and a voice control medium, which are applied to an applet engine of an applet running platform running on a vehicle-mounted central control device, wherein the method comprises the following steps: when a first voice control instruction aiming at a first interface displayed by the first small program is detected, determining the voice content of the first voice control instruction; inquiring a wake-up word set of a first interface according to the voice content, and determining a target wake-up word corresponding to the voice content; inquiring a response event set of the first interface according to the target awakening word, and determining a target response event corresponding to the target awakening word, wherein the response event set is generated by calling the video user interface operation module and comprises a second corresponding relation between the awakening word of the component and the response event; executing the target response event; according to the embodiment of the invention, the corresponding component can be quickly positioned according to the voice control instruction, and the bound response event is triggered, so that the function of quick voice control is realized.

Description

Voice control method, device, terminal and storage medium

Technical Field

The present invention relates to the field of speech technology, and in particular, to the field of speech processing technology, and more particularly, to a speech control method, a speech control apparatus, a terminal, and a computer storage medium.

Background

With the increasing popularity of applets, users can apply to various APPs, vehicle-mounted systems or smart home systems, and an applet page is composed of a large number of components (view, text and button …) and can interact through gesture operations (such as clicking) of the users. The inventor finds in practice that: existing speech control methods can result in 1) speech recognition being time consuming; 2) the occupied performance resources are high; from voice input to target component determination and response initiation, 2-3 seconds are often needed, and interaction experience of voice control of a user is seriously influenced.

Disclosure of Invention

The embodiment of the invention provides a voice control method, a voice control device, a terminal and a computer storage medium, which can quickly locate a corresponding component according to a voice control instruction and trigger a response event bound by the corresponding component, thereby realizing a quick voice control function.

In one aspect, an embodiment of the present invention provides a voice control method, which is applied to an applet engine of an applet running platform run by a vehicle-mounted central control device, where the applet engine is provided with a video user interface running module, and the applet running platform runs a first applet, and the voice control method includes:

when a first voice control instruction of a first interface displayed for the first applet is detected, determining voice content of the first voice control instruction, wherein the first interface comprises at least one component;

inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and a wake-up word, the reference component is a component bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface;

inquiring a response event set of the first interface according to the target awakening word, and determining a target response event corresponding to the target awakening word, wherein the response event set is generated by calling the video user interface operation module, and the response event set comprises a second corresponding relation between the awakening word of the component and the response event;

and executing the target response event.

The at least one component comprises at least one foreground component and/or at least one background component, the foreground component is a component displayed on the first interface, and the background component is a component not displayed on the first interface.

Wherein a single reference component in the first correspondence corresponds to one or more wake words.

Wherein the determining the voice content of the first voice control instruction comprises:

acquiring a preset voice model;

and determining the voice content of the first voice control instruction according to the preset voice model.

Wherein, prior to the detecting of the first voice control instruction for the first interface displayed by the first applet, the method further comprises:

the following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first small program:

acquiring at least one component of the first interface;

traversing the at least one component;

determining one or more reference components bound respectively to the at least one component and having a response event;

generating at least one reference component from the one or more components;

and generating a wake-up word subset of each reference assembly according to the description information of each reference assembly in the at least one reference assembly to obtain the wake-up word set.

Wherein the generating at least one reference component from the one or more components comprises:

and determining that each component in the one or more components is a reference component to obtain at least one reference component.

determining a usage record for each of the one or more components;

determining the number of times of use of each component according to the record of use of each component;

and determining the component with the use times larger than the preset use times in the one or more components as a reference component to obtain at least one reference component.

Wherein the content of the first and second substances,

identifying description information for each of the reference components;

performing word segmentation on the description information to obtain a wakeup word subset corresponding to each component;

and establishing a corresponding relation between the awakening word subset corresponding to each component and the corresponding component to obtain the awakening word set.

On the other hand, an embodiment of the present invention provides a voice control apparatus, which is applied to a vehicle-mounted central control device, and the apparatus includes:

the determining unit is used for determining the voice content of a first voice control instruction when the first voice control instruction of a first interface displayed by the first applet is detected, wherein the first interface comprises at least one component;

the determining unit is further configured to query a wake-up word set of the first interface according to the voice content, and determine a target wake-up word corresponding to the voice content, where the wake-up word set includes a first correspondence between a reference component and a wake-up word, the reference component is a component bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface;

the determining unit is further configured to query a response event set of the first interface according to the target wake-up word, and determine a target response event corresponding to the target wake-up word, where the response event set is generated by invoking the video user interface running module, and the response event set includes a second correspondence between the wake-up word of the component and the response event;

and the execution unit is used for executing the target response event.

Wherein, the determining unit is specifically configured to:

acquiring a preset voice model;

Before the first voice control instruction for the first interface displayed by the first applet is detected, the method further includes a generating unit, where the generating unit is specifically configured to:

acquiring at least one component of the first interface;

traversing the at least one component;

generating at least one reference component from the one or more components;

Wherein the generating of the at least one reference component from the one or more components is specifically configured to:

determining a usage record for each of the one or more components;

Wherein the generating unit is specifically configured to generate a wake-up word subset of each reference component according to the description information of each reference component in the at least one reference component, to obtain the wake-up word set, and the generating unit is configured to:

identifying description information for each of the reference components;

In another aspect, an embodiment of the present invention provides a terminal, where the terminal includes an input device and an output device, and the terminal further includes:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

and executing the target response event.

In yet another aspect, an embodiment of the present invention provides a computer storage medium, where one or more instructions are stored, and the one or more instructions are adapted to be loaded by a processor and execute the following steps:

and executing the target response event.

The embodiment of the invention firstly determines the voice content of a first voice control instruction when the first voice control instruction of a first interface displayed by a first small program is detected, wherein the first interface comprises at least one component, secondly, a wake-up word set of the first interface is inquired according to the voice content, a target wake-up word corresponding to the voice content is determined, the wake-up word set comprises a first corresponding relation between a reference component and a wake-up word, the reference component is a component which is bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface, secondly, a response event set of the first interface is inquired according to the target wake-up word, a target response event corresponding to the target word is determined, and the response event set is generated by calling a video user interface operation module, and the response event set comprises a second corresponding relation between the awakening words of the components and the response events, and finally, the target response events are executed. Therefore, the awakening words can be quickly positioned according to the voice control instruction, the components corresponding to the awakening words are inquired, the bound response events are triggered, the quick voice control function is realized, the time consumption of voice recognition is reduced, the occupied performance resources are low, and the high efficiency and the intelligence of the voice control method are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic architecture diagram of a communication system according to an embodiment of the present invention;

FIG. 2a is a flow chart of a voice control method according to an embodiment of the present invention;

FIG. 2b is a schematic view of an interface provided by an embodiment of the present invention;

FIG. 3a is a diagram of a software architecture provided by an embodiment of the present invention;

FIG. 3b is a response flow chart provided by an embodiment of the present invention;

FIG. 4 is a flow chart illustrating a voice control method according to another embodiment of the present invention;

FIG. 5 is a flow chart illustrating a voice control method according to another embodiment of the present invention;

FIG. 6a is a schematic view of an interface provided by an embodiment of the present invention;

FIG. 6b is a schematic view of another interface provided by an embodiment of the present invention;

FIG. 6c is a diagram of a speech input provided by an embodiment of the present invention;

FIG. 6d is a schematic interface diagram of a play page according to an embodiment of the present invention;

FIG. 7 is a schematic interface diagram of an in-vehicle central control device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The embodiment of the invention provides a voice control scheme which can quickly locate a corresponding component according to a voice control instruction and trigger a response event bound by the corresponding component, thereby realizing a quick voice control function; the voice control scheme can be applied to the vehicle-mounted central control device, and can also be applied to other devices, where the other devices may include but are not limited to: smart phones, tablets, laptops, and desktops, among others. The vehicle-mounted central control equipment can be connected with the server and carries out interactive communication with the server. And executing a voice control scheme in the corresponding small program of the vehicle-mounted central control equipment according to actual service requirements. For example, an applet can be run in the vehicle-mounted central control device to control and process voice instructions of a user.

The voice control scheme proposed by the embodiment of the present invention is described below by taking as an example that the voice control scheme is applied to the communication system shown in fig. 1 and a car applet is invoked to execute the voice control scheme: as shown in fig. 1, after detecting the voice input of the user, the vehicle-mounted central control device may match the content in a certain component in the current page through a voice control instruction, and then may quickly respond to the interactive operation of the component; and the content in the current page is obtained through interactive communication between the vehicle-mounted central control equipment and the server.

Based on the above description, an embodiment of the present invention provides a voice control method, where the voice control method is applied to an applet engine of an applet running platform running on a vehicle-mounted central control device, the applet engine is provided with a video user interface running module, and the applet running platform runs a first applet. Referring to fig. 2a, the voice control method may include the following steps S201 to S204:

s201, when a first voice control instruction of a first interface displayed by the first small program is detected, determining the voice content of the first voice control instruction, wherein the first interface comprises at least one component.

The voice content may include interface content displayed on the current first interface or content not displayed on the current first interface, which is not limited herein.

When a first voice control instruction of a first interface is detected, voice content can be obtained from the first interface which is loaded in advance; or the voice content is acquired on the first interface which is not loaded in advance.

S202, inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and a wake-up word, the reference component is a component bound with a response event in at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface.

If the first interface of the current vehicle-mounted middle frame device is detected, as shown in fig. 2b, the corresponding wake-up word set in fig. 2 b: { moon, legend, lightning, legend, cloud, drift, my heart, drift, heart, corresponding, today, very funny, my, just, laugh, sunshine, fly }; then, the first correspondence between the reference component and the wake-up word may be: moon legend- -moon, legend; lightning legend-lightning, legend; cloud fluttering- -cloud, fluttering; i am in heart-I am in heart and heart; heart to heart correspondence-heart to heart; today very bereaved-today very bereaved; i can not smile-I can not smile; sunlight flying-sunlight, flying.

The original control logic refers to an initial control strategy of a first interface pushed by a server, or an initial control strategy of an original APP to which the first interface belongs. When the first interface is supported by the applet platform, the voice control capability of the component in the first interface is given by the applet platform. The touch operation function is to receive a touch operation instruction of a user, determine a corresponding component according to the touch operation instruction, and then execute a response event corresponding to the component.

For example, as described in fig. 2b, the in-vehicle central control device detects a touch instruction for "moon story", and enters the display interface of "moon story" according to the touch instruction.

The first corresponding relationship may be one-to-one, one-to-many, or many-to-many, and is not limited herein.

The component bound with the response event can be quickly executed without traversing the component again.

S203, inquiring a response event set of the first interface according to the target awakening word, and determining a target response event corresponding to the target awakening word, wherein the response event set is generated by calling the video user interface operation module, and the response event set comprises a second corresponding relation between the awakening word of the component and the response event.

And S204, executing the target response event.

Fig. 3a is a diagram of a software architecture according to an embodiment of the present invention. In the present application, the above method can be implemented by the software architecture diagram shown in fig. 3a, which mainly relates to three modules, namely: a video user interface running module (VUI Runtime), a voice processing module (Skill Handler) and a voice wake-free module. The integration of these three modules in the applet framework allows for visual, i.e., spoken, voice control operations of any applet. The modules involved in the present speech control scheme are described as follows:

VUI Runtime module: the system is responsible for maintaining and updating active components in the current visible area of the small program;

the voice wake-up-free module: the voice recognition device is responsible for collecting the latest awakening words and storing the latest awakening words as a list, and is used for performing acoustic matching on the input voice and outputting the awakening words matched with the input voice signals;

the Skill Handler module: and the module is responsible for deciding an event matched with the awakening word from the list, taking out the event type and the response function of the event, issuing a corresponding event and finally triggering the operation bound by the module.

Fig. 3b shows a response flow chart corresponding to the software framework diagram, where fig. 3b is a response flow chart provided in an embodiment of the present invention. The specific processes of the three modules are as follows:

starting an applet, retrieving a visible area in a page in the VUI Runtime module, binding a component with response time, generating information capable of performing voice operation node list (VUI node List), and creating a page change monitor (DOM Observer), a page rolling monitor (Scroll Observer) and other monitors, namely, immediately retrieving the component in the current visible area again after the content of the page changes or a user rolls the page, and updating the list capable of performing voice operation node; and performing word segmentation processing on the text content in the component according to the voice operation node list to obtain a plurality of awakening words, transmitting the plurality of awakening words to the voice wake-free module, and synchronizing the voice operation node list to the Skill Handler module.

Wherein, in the page storing the current visible area in VUI node list, the list of all node (component) information capable of interactive operation includes: text content in the component, the bound event name, the location/size of the component's presentation, etc.

In the voice wake-up-free module, receiving a plurality of wake-up words transmitted by the VUI Runtime module, and storing the wake-up words as a list; and when the voice input of the user is detected, performing acoustic matching on the input voice, and outputting a target awakening word matched with the input voice signal to the Skill Handler module.

The awakening words can be obtained according to an awakening technology, whether awakening is carried out or not is directly judged through matching of signals and an acoustic model through the awakening technology, the awakening technology belongs to an off-line algorithm, and the speed is much more efficient than that of a voice recognition technology.

In the above-mentioned Skill Handler module, receive the voice operation node list from VUI Runtime module and come from the voice and exempt from the goal awakening word in the awakening module, and then decide out the goal voice node matched with awakening word from VUI NodeList, and take out its event type and response function, and issue the corresponding event, can trigger the operation that the assembly binds finally.

Fig. 4 is a schematic flow chart of another voice control method according to an embodiment of the present invention. The voice control method is applied to an applet engine of an applet running platform running on a vehicle-mounted central control device, the applet engine is provided with a video user interface running module, and a first applet runs on the applet running platform. As shown in fig. 4, the voice control method may include the following steps S401 to S407:

s401, in the process of loading and rendering the first interface by the first small program, at least one component of the first interface is obtained through the video user interface operation module.

S402, traversing the at least one component.

And S403, determining one or more reference components which are respectively bound with the response event in the at least one component.

S404, generating at least one reference component according to the one or more components.

Accordingly, step S404 may include the following step S11:

s11, determining that each of the one or more components is a reference component, and obtaining at least one reference component.

Therefore, in the example, each component is set as a reference component, so that the reference component can be comprehensively obtained, the condition that the component is lost or omitted is avoided, and the accuracy of the voice control method is improved.

Accordingly, step S404 may include the following steps S21-S23:

s21, determining a usage record for each of the one or more components.

s22, determining the usage times of each component according to the usage record of each component.

And s23, determining the components with the use times larger than the preset use times in the one or more components as reference components, and obtaining at least one reference component.

The preset number of times of use may be a manufacturer's own setting or a user's own setting at the time of delivery, and is not limited uniquely here.

Optionally, in the determining of the usage record of each component in the one or more components, the usage record may be obtained periodically according to a usage time, where the usage time may be set to a preset time period, and the preset time period may be within a week or a month, which is not limited herein.

Optionally, the at least one component obtained by screening according to the number of times of use may be set as a common component, that is, after the next initialization, the component is loaded at the same time with the interface started next time.

Therefore, in this example, the high-frequency component is determined through the usage record of each component, that is, the component which is most likely to be woken up by the user is obtained according to the user preference, and the intelligence of component generation is improved.

S405, generating a wake-up word subset of each reference assembly according to the description information of each reference assembly in the at least one reference assembly to obtain the wake-up word set.

Accordingly, step S404 may include the following steps S31-S33:

s31, identifying the description information of each component in the reference components.

And S32, performing word segmentation processing on the description information to obtain a wakeup word subset corresponding to each component.

And S33, establishing a corresponding relation between the awakening word subset corresponding to each component and the corresponding component to obtain the awakening word set.

The description information may be text content, a bound event name, a position and a size of a component display, and the like, which is not limited herein.

The word segmentation processing can be used for segmenting according to morpheme composition, such as overlapping, adding and compounding; or processing according to word segmentation units, dictionary entries and grammatical words, which is not limited herein.

Therefore, in the example, the components are traversed in advance in the process of loading and rendering the first interface, which is beneficial to improving the rapidness and the accuracy of the voice control method.

S406, when a first voice control instruction for a first interface displayed by the first applet is detected, determining the voice content of the first voice control instruction, wherein the first interface comprises at least one component.

S407, inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, where the wake-up word set includes a first corresponding relationship between a reference component and a wake-up word, the reference component is a component bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface.

S408, inquiring a response event set of the first interface according to the target awakening word, and determining a target response event corresponding to the target awakening word, wherein the response event set is generated by calling the video user interface operation module, and the response event set comprises a second corresponding relation between the awakening word of the component and the response event.

And S409, executing the target response event.

In one embodiment, the at least one component includes at least one foreground component and/or at least one background component, the foreground component is a component displayed on the first interface, and the background component is a component not displayed on the first interface.

The background component may be a general component, for example, the general component may be "previous page", "next page", "back", "return", or "close applet", and the like, which is not limited herein.

Therefore, in this example, the default wake-up word can be quickly located according to the voice control instruction, the general component corresponding to the default wake-up word is queried, and the bound response event is triggered, so that a quick voice control function is realized, time consumption of voice recognition is reduced, performance resources are low, and high efficiency and intelligence of the voice control method are improved.

In one embodiment, a single reference component in the first correspondence corresponds to one or more wake words.

Optionally, recognizing text content of each component in the reference component, and performing word segmentation processing to obtain a wakeup word corresponding to each component; and establishing a corresponding relation between the awakening word corresponding to each component and the component corresponding to the awakening word to obtain the first corresponding relation.

Therefore, in this example, according to the correspondence between the reference component and the wake-up word, after the target wake-up word is obtained, the corresponding event can be accurately executed, and the timeliness and the accuracy of the voice control method are improved.

In one embodiment, the determining the voice content of the first voice control instruction includes: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

The preset speech model is usually composed of an acoustic model and a language model, and respectively corresponds to the calculation of the speech-to-syllable probability and the calculation of the syllable-to-word probability.

Therefore, in the example, the voice content input by the user is accurately obtained through the preset voice model, so that the phenomenon that the wrong voice content is recognized and operation is mistakenly executed is avoided, and the accuracy of the voice control method is improved.

The following describes the speech control method in detail by taking an applet that applies the speech control method provided by the embodiment of the invention to a vehicle-mounted central control device as an example.

The applets herein may include, but are not limited to: the game small program, the video small program, the audio small program and the like, wherein the video small program is used for playing videos through the vehicle-mounted central control equipment, and the audio small program is used for playing audios through the vehicle-mounted central control equipment.

In the embodiment of the invention, interactive components in the small program page are automatically identified, and when a user inputs text contents in a voice mode, if the contents in a certain component in the current page are matched, the interactive operation of the component can be quickly responded as an example for explanation; the specific process can be seen in fig. 5:

when the vehicle-mounted central control equipment detects the opening operation aiming at the first small program, network communication connection with a server is established, in the process of loading and rendering the current interface of the first small program, the loading interface content of the current small program is obtained from the server, the current first small program interface is identified, all components are traversed, the text content in the components is taken out, the voice input by a user is obtained, the components matched with the voice text are found, and the interface of the components is jumped.

As shown in fig. 6a, fig. 6a is a schematic view of an interface when the rendering state is in use;

FIG. 6b is a schematic view of the interface after loading is completed, as shown in FIG. 6 b; obtaining components in the current interface as { moon legend, lightning legend, cloud drift, towards army, love reading, sea, i's laugh, sunshine flying, closing, returning, next page and playing }; then obtaining the current awakening words as { moon, legend, lightning, legend, cloud, floating, rush, champion, I, love, reading, I, if, smile, sunshine, flying, ao, sea, close, return, next page and play }, and in the current interface, "search" is to display the internal search box of the applet and convert the internal search box into an input state, and "look over more" is to jump to the hotline movie list page;

when the voice of the user is obtained, finding out an awakening word matched with the voice text, as shown in fig. 6c, obtaining that the voice input by the user is "champion", finding out a corresponding awakening word is "champion", and a corresponding component is-rushing champion, that is, selecting "rushing champion" and jumping to a playing page of "rushing champion", as shown in fig. 6d, and fig. 6d is an interface schematic diagram of the playing page.

If the obtained voice of the user is "close the current interface", finding the corresponding wake-up word as close, and if the wake-up word is the default wake-up word, jumping to the interface shown in fig. 7, where fig. 7 may be an interface diagram of the vehicle-mounted central control device after the applet is closed.

Based on the description of the foregoing voice control method embodiment, the embodiment of the present invention further discloses a voice control apparatus, which may be a computer program (including a program code) running in a terminal. The voice control device may perform the method shown in fig. 2a or fig. 4. Referring to fig. 8, the voice control apparatus may operate as follows:

the determining unit 101 is configured to, when a first voice control instruction for a first interface displayed by the first applet is detected, determine voice content of the first voice control instruction, where the first interface includes at least one component;

the determining unit 101 is further configured to query a wake word set of the first interface according to the voice content, and determine a target wake word corresponding to the voice content, where the wake word set includes a first correspondence between a reference component and a wake word, the reference component is a component bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface;

the determining unit 101 is further configured to query a response event set of the first interface according to the target wake-up word, and determine a target response event corresponding to the target wake-up word, where the response event set is generated by invoking the video user interface running module, and the response event set includes a second correspondence between the wake-up word of the component and the response event;

and the execution unit 102 is configured to execute the target response event.

In yet another embodiment, a single reference component in the first correspondence corresponds to one or more wake words.

In another embodiment, when the determining unit 101 is configured to determine the voice content of the first voice control instruction, specifically: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

In another embodiment, the voice control apparatus further includes a generating unit 103, and when the generating unit is before the first voice control instruction for the first interface displayed by the first applet is detected, the generating unit is specifically configured to: the following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first small program: acquiring at least one component of the first interface; traversing the at least one component; determining one or more reference components bound respectively to the at least one component and having a response event; generating at least one reference component from the one or more components; and generating a wake-up word subset of each reference assembly according to the description information of each reference assembly in the at least one reference assembly to obtain the wake-up word set.

In another embodiment, when the generating unit 103 generates at least one reference component according to the one or more components, it is specifically configured to: and determining that each component in the one or more components is a reference component to obtain at least one reference component.

In another embodiment, when the generating unit 103 generates at least one reference component according to the one or more components, it is specifically configured to: determining a usage record for each of the one or more components; determining the number of times of use of each component according to the record of use of each component; and determining the component with the use times larger than the preset use times in the one or more components as a reference component to obtain at least one reference component.

In another embodiment, when the wake-up word subset of each reference component is generated according to the description information of each reference component in the at least one reference component, and the wake-up word set is obtained, the processing unit 102 is specifically configured to: identifying description information for each of the reference components; performing word segmentation on the description information to obtain a wakeup word subset corresponding to each component; and establishing a corresponding relation between the awakening word subset corresponding to each component and the corresponding component to obtain the awakening word set.

According to an embodiment of the present invention, the steps involved in the method shown in fig. 2a or fig. 4 may be performed by units in the speech control apparatus shown in fig. 8. For example, steps S201, S202, S203 shown in fig. 2a may be performed by the determination unit 101 shown in fig. 8, and step S204 may be performed by the execution unit 102 shown in fig. 8; as another example, steps S401 to S405 shown in fig. 4 may be performed by the generation unit 103 shown in fig. 8, steps S402 to S408 may be performed by the determination unit 101 shown in fig. 8, and S409 may be performed by the execution unit 102 shown in fig. 8.

According to another embodiment of the present invention, the units in the speech control apparatus shown in fig. 8 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form the same operation, without affecting the achievement of the technical effect of the embodiment of the present invention. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present invention, the voice-based control device may also include other units, and in practical applications, these functions may also be implemented by the assistance of other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present invention, the voice control apparatus as shown in fig. 8 may be constructed by running a computer program (including program codes) capable of executing steps involved in the respective methods as shown in fig. 2a or fig. 4 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and a voice control method of the embodiment of the present invention may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

Based on the description of the method embodiment and the device embodiment, the embodiment of the invention also provides a terminal. Referring to fig. 9, the terminal includes at least a processor 201, an input device 202, an output device 203, and a computer storage medium 204. Wherein the processor 201, input device 202, output device 203, and computer storage medium 204 within the terminal may be connected by a bus or other means.

A computer storage medium 204 may be stored in the memory of the terminal, said computer storage medium 204 being adapted to store a computer program comprising program instructions, said processor 201 being adapted to execute said program instructions stored by said computer storage medium 204. The processor 201 (or CPU) is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute the one or more instructions so as to implement a corresponding method flow or a corresponding function; in one embodiment, the processor 201 according to the embodiment of the present invention may be configured to perform a series of voice control processes, including: when a first voice control instruction of a first interface displayed for the first applet is detected, determining voice content of the first voice control instruction, wherein the first interface comprises at least one component; inquiring a wake-up word set of the first interface according to the voice content, and determining a target wake-up word corresponding to the voice content, wherein the wake-up word set comprises a first corresponding relation between a reference component and a wake-up word, the reference component is a component bound with a response event in the at least one component, and the reference component only supports a touch operation function in an original control logic of the first interface; inquiring a response event set of the first interface according to the target awakening word, and determining a target response event corresponding to the target awakening word, wherein the response event set is generated by calling the video user interface operation module, and the response event set comprises a second corresponding relation between the awakening word of the component and the response event; execute the target response event, and so on.

The embodiment of the invention also provides a computer storage medium (Memory), which is a Memory device in the terminal and is used for storing programs and data. It is understood that the computer storage medium herein may include a built-in storage medium in the terminal, and may also include an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 201. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by the processor 201 to implement the corresponding steps of the method described above in connection with the scene cut embodiment; in particular implementations, one or more instructions in the computer storage medium are loaded by processor 201 and perform the following steps:

and executing the target response event.

In still another embodiment, when determining the voice content of the first voice control instruction, the one or more instructions may be further loaded and specifically executed by the processor 201: acquiring a preset voice model; and determining the voice content of the first voice control instruction according to the preset voice model.

In yet another embodiment, before the detecting of the first voice control instruction for the first interface displayed by the first applet, the one or more instructions may be further loaded and specifically executed by the processor 201: the following operations are executed through the video user interface running module in the process of loading and rendering the first interface by the first small program: acquiring at least one component of the first interface; traversing the at least one component; determining one or more reference components bound respectively to the at least one component and having a response event; generating at least one reference component from the one or more components; and generating a wake-up word subset of each reference assembly according to the description information of each reference assembly in the at least one reference assembly to obtain the wake-up word set.

In yet another embodiment, when the at least one reference component is generated according to the one or more components, the one or more instructions may be further loaded and specifically executed by the processor 201: and determining that each component in the one or more components is a reference component to obtain at least one reference component.

In yet another embodiment, when the at least one reference component is generated according to the one or more components, the one or more instructions may be further loaded and specifically executed by the processor 201: determining a usage record for each of the one or more components; determining the number of times of use of each component according to the record of use of each component; and determining the component with the use times larger than the preset use times in the one or more components as a reference component to obtain at least one reference component.

In another embodiment, when the subset of the wake-up words of each reference component is generated according to the description information of each reference component in the at least one reference component, and the set of wake-up words is obtained, the one or more instructions may be further loaded and specifically executed by the processor 201: identifying description information for each of the reference components; performing word segmentation on the description information to obtain a wakeup word subset corresponding to each component; and establishing a corresponding relation between the awakening word subset corresponding to each component and the corresponding component to obtain the awakening word set.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A voice control method is characterized in that the method is applied to an applet engine of an applet running platform running on a vehicle-mounted central control device, the applet engine is provided with a video user interface running module, a first applet runs on the applet running platform, and the method comprises the following steps:

and executing the target response event.

2. The method of claim 1, wherein the at least one component comprises at least one foreground component and/or at least one background component, the foreground component being a component displayed at the first interface and the background component being a component not displayed at the first interface.

3. The method of claim 2, wherein a single reference component in the first correspondence corresponds to one or more wake words.

4. The method of any of claims 1-3, wherein said determining the voice content of the first voice control directive comprises:

acquiring a preset voice model;

5. The method of claim 4, wherein prior to the detecting the first voice control instruction for the first interface displayed by the first applet, the method further comprises:

acquiring at least one component of the first interface;

traversing the at least one component;

generating at least one reference component from the one or more components;

6. The method of claim 5, wherein the generating at least one reference component from the one or more components comprises:

7. The method of claim 5, wherein the generating at least one reference component from the one or more components comprises:

determining a usage record for each of the one or more components;

8. The method of claim 5, wherein the generating a subset of wake words for each of the at least one reference component according to the description information of each of the at least one reference component to obtain the set of wake words comprises:

identifying description information for each of the reference components;

9. A voice control device is characterized in that the voice control device is applied to an applet engine of an applet running platform operated by vehicle-mounted central control equipment, the applet engine is provided with a video user interface running module, a first applet runs on the applet running platform, and the voice control device comprises:

and the execution unit is used for executing the target response event.

10. A terminal comprising an input device and an output device, further comprising:

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to execute the voice control method of any of claims 1-8.

11. A computer storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform a speech control method according to any of claims 1-8.