CN114639384B

CN114639384B - Voice control method and device, computer equipment and computer storage medium

Info

Publication number: CN114639384B
Application number: CN202210526270.0A
Authority: CN
Inventors: 李曼曼
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-08-23
Anticipated expiration: 2042-05-16
Also published as: CN114639384A

Abstract

The application discloses a voice control method, a voice control device, computer equipment and a computer storage medium, and relates to the technical field of voice control.

Description

Voice control method and device, computer equipment and computer storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of voice control, and provides a voice control method, a voice control device, computer equipment and a computer storage medium.

Background

At present, the car machine function of automobile market is more and more abundant, when driving the vehicle trip, often exists the condition that needs operate vehicle part or terminal equipment, nevertheless makes driver's driving vehicle mistake easily during these operations to influence the security that the vehicle was driven. Therefore, in order to improve driving safety and convenience of operation, the operation is realized in a voice control mode, and the operation is a good alternative mode, so that more terminal devices are equipped with a voice control system at present, and for example, a window can be closed or an Application (APP) can be opened in a voice control mode.

At present, light application forms such as small programs and the like are introduced into more and more application programs, and the small programs are used on site without installation, and a car manufacturer does not need to develop and integrate the small programs, only needs to maintain by the application manufacturer, and is very suitable for the characteristics of light use of a vehicle-mounted scene, small equipment capacity and long project period.

However, the current voice control function usually only stays at the application layer, and only can realize simpler functions such as opening, closing or switching of the application, so that voice control of the application function in the applet cannot be realized, the current light application control still depends on manual operation, the operation mode is single, and the safety risk of vehicle driving can be increased.

Disclosure of Invention

The embodiment of the application provides a voice control method and device, computer equipment and a computer storage medium, which are used for realizing control of light application in a voice control mode and improving safety of vehicle driving.

In one aspect, a voice control method is provided, which is applied to a voice control system, and the method includes:

responding to the starting of a voice control function of a target light application, acquiring atomic capability data from the target light application, and performing atomic capability registration based on the atomic capability data, wherein the atomic capability data comprises: atomic capabilities that the target light application can provide, each atomic capability for implementing at least one function of the target light application;

responding to first voice data input by a target object aiming at the target light application, and converting the first voice data into corresponding voice control events based on the registered atomic capabilities, wherein the voice control events comprise: at least one atomic capability that the target light application needs to invoke in order to achieve the target intent of the first voice data;

sending the voice control event to the target light application to cause the target light application to invoke the at least one atomic capability to achieve the target intent.

In one aspect, a voice control apparatus is provided, which is applied in a voice control system, and the apparatus includes:

a capability registration unit, configured to, in response to starting of a voice control function of a target light application, obtain atomic capability data from the target light application, and perform atomic capability registration based on the atomic capability data, where the atomic capability data includes: atomic capabilities that the target light application can provide, each atomic capability for implementing at least one function of the target light application;

an event conversion unit, configured to, in response to first voice data input by a target object for the target light application, convert the first voice data into a corresponding voice control event based on each registered atomic capability, where the voice control event includes: at least one atomic capability that the target light application needs to invoke in order to achieve the target intent of the first voice data;

a transmission unit, configured to send the voice control event to the target light application, so that the target light application invokes the at least one atomic capability to achieve the target intent.

Optionally, the event conversion unit is specifically configured to:

performing voice recognition on the first voice data to acquire text information contained in the first voice data;

performing semantic recognition on the text information, and determining a target intention corresponding to the text information;

converting the target intention into the voice control event based on preset event configuration information and the registered atomic capabilities, wherein the event configuration information comprises: guidance information for parameter configuration of the at least one atomic capability based on the target intent.

Optionally, the event conversion unit is specifically configured to:

obtaining application description information from the target light application, the application description information including: at least one of basic description information and voice control context information of the target light application;

and performing semantic recognition on the text information based on the application description information, and determining the target intention.

Optionally, the event conversion unit is specifically configured to:

sending the atomic capability data to a cloud server so as to register each atomic capability in the cloud server;

sending the application description information and the first voice data to the cloud server, and receiving the voice control event returned by the cloud server, wherein the voice control event is obtained by converting the target intention based on the registered atomic capability after the target intention is determined by the cloud server based on the application description information and the first voice data.

Optionally, the voice control system includes an interface component and a voice component, and the interface component encapsulates a third-party service interface provided by the target light application;

the capability registration unit is specifically configured to:

triggering the interface component to call the third-party service interface to acquire the atomic capability data in response to the starting of the voice control function of the target light application;

and sending the atomic capability data to the voice component through the interface component so as to perform atomic capability registration at the voice component.

Optionally, the terminal device where the target light application is located includes an audio acquisition device, and the interface component is packaged with a voice data acquisition interface provided by an operating system of the terminal device; the apparatus further comprises an application unit for:

calling the voice component to apply the use permission of the audio acquisition device to the operating system through the voice data acquisition interface;

and calling the voice component to receive the first voice data acquired by the audio acquisition device through the voice data acquisition interface.

Optionally, the event conversion unit is further configured to:

receiving an execution result returned by the target light application after the target light application executes the voice control event;

if the execution result indicates that feedback control is needed, starting a recording function to acquire second voice data input by the target object;

and converting the second voice data into corresponding voice control events based on the registered atomic capabilities, and sending the obtained voice control events to the target light application for execution.

Optionally, the event conversion unit is further configured to:

if the execution result indicates that feedback control is not needed, emptying the registered atomic capability;

and informing the operating system to release the use authority of the audio acquisition device occupied by the operating system.

Optionally, the apparatus further includes a wake-up unit, configured to:

responding to third voice data input by a target object aiming at a target application, and performing voice endpoint detection on the third voice data;

if the voice starting position in the third voice data is detected, performing voice recognition on the third voice data from the voice starting position until the voice ending position in the third voice data is detected;

and when the target application is determined to be awakened based on the obtained voice recognition result, performing awakening operation on the target application.

Optionally, the wake-up unit is specifically configured to:

responding to the starting of the voice awakening function of the target application, acquiring an awakening word set corresponding to an activated page in the target application, and registering awakening words based on the awakening word set;

if the voice recognition result contains the registered awakening words, determining to awaken the target application;

and performing the awakening operation on the target application, and sending text information contained in the voice recognition result to the target application.

In one aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of any one of the methods are implemented.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the methods described above.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, when a target light application starts a voice control function, atomic capability data are obtained from the target light application to perform atomic capability registration, and when first voice data input for the target light application are received, the first voice data are converted into corresponding voice control events based on the registered atomic capabilities, that is, target intentions for realizing the first voice data, the target light application needs to call which atomic capabilities, so that the voice control events are sent to the target light application, and the target light application calls the corresponding atomic capabilities to realize the voice control intentions. That is to say, in the embodiment of the present application, when the voice control function is started, the atomic capability of the target light application is registered, that is, which functions can be implemented by the target light application can be known, and then when the voice control data is received, the intention of the voice control data can be converted based on the registered atomic capability, and a voice control event executable by the target light application is obtained, so that the control effect of the light application is implemented by using a voice control mode, and then, when driving, manual operation is not required, thereby improving the safety of vehicle driving.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a voice control method according to an embodiment of the present application;

FIGS. 3A-3C are schematic diagrams of voice-controlled interfaces provided in an embodiment of the present application;

FIG. 4 is a flow chart illustrating event conversion according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a multi-round voice session control process provided in an embodiment of the present application;

FIGS. 6A-6C are schematic diagrams illustrating the execution results of the target light application according to the embodiment of the present disclosure;

fig. 7 is a schematic flowchart of voice control for an application according to an embodiment of the present application;

fig. 8A and 8B are schematic diagrams of pages awakened by an application according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a voice control system according to an embodiment of the present application;

FIG. 10 is a diagram illustrating relationships between function classes provided by an embodiment of the present application;

FIG. 11 is a block diagram of an overall system speech framework according to an embodiment of the present application;

FIG. 12 is a flow chart illustrating the use of voice control by an applet according to an embodiment of the present application;

FIG. 13 is an exemplary diagram of semantic recognition and execution provided by embodiments of the present application;

FIG. 14 is a schematic flow chart of intent recognition provided by an embodiment of the present application;

fig. 15 is a schematic flowchart of a voice wake-up process performed by an application according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of another computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

light application: the light application is a program which is hosted by a client and runs in a hosting running environment, a user can obtain services provided by a light application provider by directly opening the light application in the hosting running environment, and the light application can be in the form of an applet, a public number and the like. The characteristic that the client does not need to be downloaded or used for searching is that the client is installed without occupying the storage space of the terminal equipment, so that the storage space is saved, and the operation is quick. In the embodiment of the application, the host running environment of the light application can be an application program client or an operating system of the vehicle-mounted terminal device.

Applet (mp-parser, MPP): the method is a light application form, and also has the characteristic of searching without downloading.

Atomic capability (aa): for a section of program, the atomicity of the program refers to all operations in the whole program, which are either completely completed or not completed, and cannot be stalled in an intermediate link, and accordingly, the atomic operation cannot be interrupted, or is successfully or unsuccessfully executed, so that the atomic capability refers to the minimum capability unit which can be independently executed. For light applications, an atomic capability can implement at least one function in the light application, for example, for XX storytelling, which is a reading-type light application, its atomic capability can be used to implement related functions of playing a story, searching a story, etc., or for XX video, which is a video playing application, its atomic capability can be used to implement related functions of video searching, video playing, video pausing, etc.

In an application program, each function is usually packaged into each function to be called to implement the corresponding function, so that the atomic capability is understood from another perspective, an atomic capability can also be regarded as a function, which can contain function name, parameters and other related information, and when an atomic capability is called, a function is executed, for example, the implementation of a video playing function is implemented based on calling a video playing interface function.

Intention is: which is the intended purpose of the input speech data. For example, a segment of 'playing XX story with XX story' is input, which is intended to play the content of 'XX story' through an application program (which can be a client or a light application) of 'XX story', and it is generally understood that to achieve this purpose, a resource of the content of 'XX story' needs to be searched for in the application program of 'XX story', and the playing of the 'XX story' needs to be started in a player invoking 'XX story', although other links may be involved in practical situations.

Voice endpoint detection: also called Voice Activity Detection (VAD), Voice boundary Detection, etc., for detecting a Voice state of a segment of audio data, where the Voice state includes a start state, a pause resume state, and an end state.

Automatic Speech Recognition (ASR): the goal is to convert the lexical content in the speech data into computer readable input such as keystrokes, binary codes or character sequences.

Activating the page: for example, when the application is in a foreground running state, the active page refers to a page currently opened by the application, and when the application is in a background running state, the active page may refer to a page opened before the application is switched.

The embodiments of the present application relate to Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge, and uses the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The scheme provided by the embodiment of the application relates to technologies such as an artificial intelligence voice technology and a natural language processing technology. Specifically, in the embodiment of the present application, when the target light application starts the voice control function, the atomic capability data is obtained from the target light application, and atomic capability registration is performed, and when voice data is received, voice recognition is performed on the collected voice data by using a voice technology, and semantic understanding is performed on a voice recognition result based on a natural language processing technology, so as to determine a target intention corresponding to the voice data, so that the target intention is converted based on the registered atomic capability, and a voice control event executable by the target light application is generated. In addition, voice synthesis can be performed based on a voice technology, and voice for man-machine interaction with a driver can be generated and played.

The following briefly introduces the design concept of the embodiments of the present application.

At present, light application forms such as applets are introduced into more and more application programs, however, the function of voice control currently only stays at the application program level, and only simple functions such as opening, closing or switching of the application programs can be realized, so that voice control of application functions in the applets cannot be realized, the control of the light application currently still depends on manual operation, the operation mode is single, and the safety risk of vehicle driving can be increased.

In view of this, an embodiment of the present application provides a voice control method, in which when a target light application starts a voice control function, atomic capability data is obtained from the target light application to perform atomic capability registration, and when first voice data input for the target light application is received, the first voice data is converted into corresponding voice control events based on each registered atomic capability, that is, a target intention of implementing the first voice data, which atomic capabilities need to be called by the target light application, so that the voice control events are sent to the target light application, and the target light application calls the corresponding atomic capability to implement the voice control intention. That is to say, in the embodiment of the present application, when the voice control function is started, the atomic capability of the target light application is registered, that is, which functions can be implemented by the target light application can be known, and then when the voice control data is received, the intention of the voice control data can be converted based on the registered atomic capability, and a voice control event executable by the target light application is obtained, so that the control effect of the light application is implemented by using a voice control mode, and then, when driving, manual operation is not required, thereby improving the safety of vehicle driving.

In addition, in the embodiment of the present application, a target application in the terminal device may also be woken up in a voice wake-up manner, specifically, by registering a wake-up word set of an activation page of the target application, when voice data is received, voice recognition is performed, so that matching is performed based on a recognition result and the wake-up word set, and when matching is possible, a wake-up operation is performed on the target application, a corresponding function is executed, and a function of voice wake-up is implemented.

Some simple descriptions are given below to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be suitable for most voice control scenes, and is particularly suitable for the voice control scene of the vehicle-mounted terminal equipment. As shown in fig. 1, an application scenario schematic diagram provided in the embodiment of the present application may include a vehicle-mounted terminal device 101 and a cloud server 102.

The vehicle-mounted terminal device 101 may be, for example, a vehicle central control device, a terminal device communicatively connected with a vehicle, and the like, including but not limited to a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft, and the like. The vehicle-mounted terminal device 101 may run a voice control process, a controlled application, and a light application, and the cloud server 102 may provide a background server of a background service for the voice control process, for example, the cloud server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform, but is not limited thereto.

It should be noted that, the voice control method in the embodiment of the present application may be executed by the vehicle-mounted terminal device 101 or the cloud server 102 alone, or may be executed by both the cloud server 102 and the vehicle-mounted terminal device 101. For example, in response to the start of the voice control function of the target light application, the in-vehicle terminal device 101 acquires the atomic capability data from the target light application, performs atomic capability registration based on the atomic capability data, converts the input first voice data into a corresponding voice control event based on each atomic capability that has been registered, and transmits the voice control event to the target light application, so that the target light application calls at least one atomic capability to achieve the target intention. Alternatively, the cloud server 102 performs the above steps. Or, the vehicle-mounted terminal device 101 responds to the start of the voice control function of the target light application, acquires the atomic capability data from the target light application, performs atomic capability registration based on the atomic capability data, receives the input first voice data, performs semantic recognition and conversion on the first voice data by using rich computing resources of the cloud server 102, generates a corresponding voice control event, and sends the voice control event to the target light application by the vehicle-mounted terminal device 101, so that the target light application calls at least one atomic capability to realize a target intention.

Taking the vehicle-mounted terminal device 101 to perform the above steps as an example, the vehicle-mounted terminal device 101 may include one or more processors, memories, I/O interfaces for interacting with terminals, and the like. The memory of the vehicle-mounted terminal device 101 may further store program instructions of the voice control method provided in the embodiment of the present application, and when the program instructions are executed by the processor, the program instructions can be used to implement the steps of the voice control method provided in the embodiment of the present application, so as to implement the corresponding voice control process.

In the embodiment of the present application, the vehicle-mounted terminal device 101 and the cloud server 102 may be directly or indirectly connected through one or more networks 103. The network 103 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may also be other possible networks, which is not limited in this embodiment of the present invention.

It should be noted that fig. 1 is only an example, and actually, the number of the vehicle-mounted terminal device and the cloud server is not limited, and is not specifically limited in the embodiment of the present application. Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

In a possible application scenario, in order to reduce the communication delay of the retrieval, the cloud servers 102 may be deployed in various regions, or in order to balance the load, different cloud servers 102 may respectively serve the vehicle-mounted terminal devices 101 in different regions, for example, the vehicle-mounted terminal device 101 is located at a location a, and establishes a communication connection with the cloud server 102 of the service location a, the vehicle-mounted terminal device 101 is located at a location b, and establishes a communication connection with the cloud server 102 of the service location b, and the plurality of cloud servers 102 form a data sharing system, and share data through a block chain.

Each cloud server 102 in the data sharing system has a node identifier corresponding to the cloud server 102, and each cloud server 102 in the data sharing system may store node identifiers of other cloud servers 102 in the data sharing system, so that the generated block is broadcast to other cloud servers 102 in the data sharing system according to the node identifiers of other cloud servers 102 in the following. Each cloud server 102 may maintain a node identifier list, and the name of the cloud server 102 and the node identifier are correspondingly stored in the node identifier list. The node identifier may be an Internet Protocol (IP) address of an interconnection between networks and any other information that can be used to identify the node.

The method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, it should be noted that the above application scenarios are only shown for the convenience of understanding the spirit and principle of the present application, and the embodiments of the present application are not limited in any way in this respect.

Referring to fig. 2, a schematic flow chart of a voice control method provided in the embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 201: responding to the starting of the voice control function of the target light application, acquiring atomic capability data from the target light application, and performing atomic capability registration based on the atomic capability data, wherein the atomic capability data comprises: atomic capabilities that the targeted light application can provide, each atomic capability for implementing at least one function of the targeted light application.

In the embodiment of the present application, the starting of the voice control function of the target light application may be triggered based on a variety of conditions, including but not limited to the following conditions:

(1) and starting the target light application, namely when the target light application is started, starting the voice control function of the target light application correspondingly.

Specifically, the target light application may be started in a button manner, for example, by clicking an icon of the target light application or by quickly starting a hardware button, or the target light application may also be started in a voice control manner, as shown in fig. 3A, a schematic diagram of starting the target light application in the voice control manner is shown, where a current page of the vehicle-mounted terminal device is a "favorite movie" page, and if it is desired to enter the light application page of "XX reading", the vehicle-mounted terminal device may receive the voice data by inputting "XX reading" in a voice manner according to fig. 3A, and may determine that the text content input by the user is "XX reading" through voice recognition, so that the text content is displayed in a voice control display area of the current page, and meanwhile, corresponding actions are executed based on the text, namely, the light application of 'XX reading' is started, and correspondingly, the voice control function aiming at 'XX reading' is started correspondingly.

(2) And operating aiming at the voice control function button in the page of the target light application. Referring to fig. 3B and fig. 3C, fig. 3B is a schematic diagram of a page in the light application, where the page is a search page in the light application, and includes a search text input box, and at the same time, a voice control function button is further included in the search page, and the search content can be input through the voice control function button in a voice input manner, so that the voice control function button is operated, and the start of the voice control function of the light application can be triggered.

It should be noted that, as can be seen in fig. 3B, the voice control function button can be set to a plurality of positions, besides the specific function, such as the position setting of the text input box in fig. 3B, can be set in the menu bar in fig. 3B, so that the voice control function can be started quickly, and even if the voice control function button is not integrated in the current page, the start of the voice control function of the light application can be triggered.

In the embodiment of the present application, when the voice control function of the target light application is started, the atomic capability data may be obtained from the target light application, and atomic capability registration is performed based on the atomic capability data, that is, the function that can be realized by the target light application and how to realize the function are known, so that the voice data is converted into the voice control event that can be executed by the target light application in the subsequent process.

In a possible implementation manner, a voice control process may be integrated in the vehicle-mounted terminal device, and an operating system of the vehicle-mounted terminal device that generally supports voice control may also integrate the voice control process, for distinction, the voice control process provided by the operating system of the vehicle-mounted terminal device is referred to as a system voice control process, which is simply referred to as system voice, the voice control process provided in the embodiment of the present application is referred to as an extended voice control process, which is simply referred to as extended voice, and the system voice and the extended voice operate as independent processes in the vehicle-mounted terminal device, that is, based on the voice control of the vehicle-mounted terminal device itself, the voice control function is further extended.

Then, when the voice control function of the target light application is started, the expanded voice is triggered to acquire the atomic capability data from the target light application, and atomic capability registration is performed based on the atomic capability data, and the purpose of the registration process is to enable the expanded voice to analyze and know the functions that can be realized by the target light application and how to realize the functions.

Specifically, the extended speech may be run in an independent process manner, that is, the extended speech and other application programs are independent programs; alternatively, the extended speech may be integrated into an application program capable of hosting light applications as a form of Software Development Kit (SDK).

Step 202: responding to first voice data input by the target object aiming at the target light application, and converting the first voice data into corresponding voice control events based on the registered atomic capabilities, wherein the voice control events comprise: to achieve the target intent of the first voice data, the target light application requires at least one atomic capability to invoke.

In the embodiment of the application, after the voice control function of the target light application is started, the target object can perform voice input for the target light application so as to perform corresponding function control. As shown in fig. 3C, after the voice control function is triggered, a corresponding voice control input area is presented, so that text information corresponding to the voice data input by the target object is displayed conveniently.

In a possible implementation manner, the extended voice in the vehicle-mounted terminal device may integrate the functions of voice recognition and semantic recognition, and then the event conversion may be performed by the extended voice of the vehicle-mounted terminal device for the first voice data input by the target object.

Specifically, referring to fig. 4, a flow chart of converting the first voice data into the corresponding voice control event is shown.

When the extended voice acquires the first voice data, the extended voice may perform voice recognition on the first voice data first, and acquire text information included in the first voice data. Referring to fig. 4, when music is currently being played, a piece of speech data is input to a target object, and speech recognition is performed on the piece of speech data to obtain that corresponding text information is "next", and at the same time, the obtained text information can be displayed to the target object, as shown in fig. 4, the text currently input by a user is "next" on a display interface. Furthermore, semantic recognition is performed based on the recognized text information to determine a target intention corresponding to the text information, so as to know the function that the target object wants to realize by the target light application, as shown in fig. 4, the intention can be known to switch to the next operation, i.e. the next operation shown in fig. 4 based on the semantic recognition. In addition, time configuration information can be stored in advance, the event configuration information can contain guidance information for parameter configuration of at least one atomic capability based on the target intention, i.e., to guide how to select atomic capability and configure the atomic capability parameters after learning intent, which may be configured as a script, i.e., where the goal is intended to be a goal, the event configuration information specifies the script process that needs to be configured to achieve such goal, in this way, the target intent can be converted into a voice control event (action) based on the preset event configuration information and the registered atomic capabilities, as shown in figure 4, the corresponding action can be obtained as "WeCarFlow" through event conversion: a canPlayNext, wherein, "canPlayNext" is the name of the function for switching the call to the next, and "WeCarFlow" is the name of APP for the currently playing music, i.e. instructs "WeCarFlow" to execute the "canPlayNext" function to switch to the next song.

In the embodiment of the application, intention recognition is carried out on voice data input by a target object through a voice recognition technology based on artificial intelligence and a natural language understanding technology, so that the intention of the target object is accurately obtained, the accuracy of the action obtained through conversion is improved, and the expanded voice can have the action conversion capacity through pre-configuring event configuration information and carrying out atomic capability registration, so that the voice data of the target object is converted into the action executable by the target light application, the voice control process of the light application is realized, and the safety of vehicle driving is improved.

In consideration of the fact that in the actual application process, identification of the intention and the information of the light application are in a certain association relationship, the same text can be identified as different intentions in different light applications, and therefore when the intention is identified, the intention can be combined with the application description information of the target light application to improve the accuracy of intention identification.

Specifically, application description (context) information may be further obtained from the target light application, and the context information may include at least one of the following information:

(1) basic description information of the target light application, for example, light application information (APP info) such as light application identification (APP ID), version number, etc., account (account) information of current login, location (location) information of current location, and current system information (system info), where the system info indicates an operating system of the current device, such as a Linux system, a windows (windows) system, etc.

(2) At least one of the context information of the voice control may be multiple rounds of conversations during the voice control, and there may be an association between the multiple rounds of conversations, for example, when the target object enters "play XX story" during the previous round of conversation, the target object may be selected if there are multiple search results found through the search, for example, the target object is prompted to "play the fourth story" and then the target object enters "play the first story", so that understanding is performed in combination with the context, and it can be accurately determined that what the target object needs to play is the first content of the current search result.

Furthermore, when the intention is identified, the text information can be semantically identified by combining the application description information, so that the accurate target intention can be obtained.

In a possible implementation manner, considering that the vehicle-mounted terminal device generally has limited computing capability, and semantic recognition generally needs to use a pre-trained semantic recognition model, so that more computing resources are generally consumed, an intention recognition process may be longer, and feedback to a target object is delayed for a longer time, so that, in order to increase the speed of intention recognition, a strong computing capability may be provided for semantic recognition by using rich computing resources in the cloud.

Therefore, when the expanded voice receives the first voice data input by the target object, the first voice data is transmitted to the cloud server, the cloud server performs the voice recognition and semantic recognition processes, and recognized text information and action are returned to the expanded voice. It should be noted that, in an actual process, there is a certain process in the target object input voice data, and the extended voice or cloud server can perform the voice recognition process simultaneously in the target object input process, that is, the target object inputs while the background recognizes, and it is not necessary to wait for the user to finish inputting and then continue the subsequent recognition process, so as to improve the real-time performance of voice control.

Specifically, if the cloud server is used for the voice recognition and semantic recognition, the extended voice can also send the atomic capability data to the cloud server for normal execution of the process, so that the atomic capabilities are registered in the cloud server.

Considering that the cloud server has a certain storage capacity, the cloud server can store the atomic capacity of each light application, that is, when the atomic capacity of the target light application is not registered in the cloud server, the expanded voice can send the acquired atomic capacity data of the target light application to the cloud server to register the atomic capacity of the target light application, and once the atomic capacity is registered, the cloud server can perform reverse registration to empty related data when the whole process is finished, or can continuously reserve the registered atomic capacity, so that when the target light application is subjected to voice control by the vehicle-mounted terminal device next time or other vehicle-mounted terminal devices perform voice control on the target light application, only the identification information, such as an APAPID and a version number, of the target light application which needs to be controlled at present needs to be sent without sending the atomic capacity data again for registration, therefore, transmission resources are saved, the data volume of transmission is reduced, and the response speed of the whole process is increased.

After the cloud server registers the atomic capability, the expanded voice acquires application description information from the target light application, and then the application description information and the first voice data can be sent to the cloud server, so that the cloud server can determine a target intention according to the application description information and the first voice data, convert the target intention based on the registered atomic capability to obtain an action, and return the action to the expanded voice.

Step 203: the voice control event is sent to the target light application such that the target light application invokes at least one atomic capability to achieve the target intent.

In the embodiment of the application, when the voice control function is started, the atomic capability of the target light application is registered, and then when the voice control data is received, the intention of the voice control data can be converted based on the registered atomic capability to obtain the action executable by the target light application, so that the control effect of the light application is realized in a voice control mode, manual operation is not needed when driving is performed, and the safety of vehicle driving is improved.

Considering that there may be multiple turns of conversation while performing voice control, after step 203, a process as shown in fig. 5, which is a flow chart of the multiple turns of voice conversation control process provided in the embodiments of the present application, may also be implemented.

Step 204: and receiving an execution result returned after the target light application executes the voice control event.

Step 205: whether feedback control is necessary is determined based on the execution result.

After the target light application executes the action, the expanded voice can obtain an execution result from the target light application.

For example, referring to fig. 6A to 6C, as shown in fig. 6A, when the text information of the input first voice data is "play a song of a child with XX music", the result of the XX music execution is to search for the song of the child in the XX music, but it may not be determined which song of the child needs to be played by the target object, as shown in fig. 6A, the XX music may output a feedback voice "find the song of the child song XX for you, and need to play the second song", and display the search result on the display interface, and according to the execution result, it may be known that the purpose of playing the song is not achieved, and it may be determined that the target object needs to perform feedback (feedback).

Or, when the text information of the input first voice data is "switch next", the XX music performs a switching operation, and the purpose is achieved by switching to the next song, so that it can be determined that feedback is not required according to the execution result.

Step 206: if the determination result in step 205 is yes, that is, if the execution result indicates that feedback control is required, the recording function is started to obtain the second voice data input by the target object.

Step 207: and converting the second voice data into corresponding voice control events based on the registered atomic capabilities, and sending the obtained voice control events to the target light application for execution.

Referring to fig. 6B, when it is determined that a feedback is needed, after the recording is started, and a second voice data input "play 2 nd" input by the target object is acquired, after the voice recognition and semantic recognition processes are performed, it may be determined that the target object intends to play the song 2 displayed on the current page, an action for playing the song is generated and sent to the target light application for execution, so that the target light application executes an operation of playing the song 2, and a music playing interface as shown in fig. 6C is presented.

The process of

steps

206 and 207 is similar to the first voice data conversion process, and therefore will not be described herein.

Step 208: if the result of the determination in step 205 is no, and if the execution result indicates that feedback control is not needed, performing a de-registration on the target light application, for example, clearing the atomic capability registered by the target light application, and notifying the operating system to release the usage right of the audio acquisition device occupied by the operating system.

In the embodiment of the application, the intention of the target object can be gradually controlled and realized in a multi-round voice conversation mode, and the target object does not need to be manually operated in each round of operation, so that the safety of vehicle driving is improved.

In the embodiment of the present application, in addition to the voice control of the light application, the voice control may still be performed on the application in the vehicle-mounted terminal device, which is shown in fig. 7 and is a schematic flow diagram for performing the voice control on the application.

Step 701: and responding to the starting of the voice awakening function of the target application, acquiring an awakening word set of the target application, and registering the awakening word based on the awakening word set.

In this embodiment of the present application, the waking of the application refers to starting the application to execute a certain function, and may include waking in the following cases:

(1) and when the vehicle-mounted terminal equipment is in the screen-breathing state, awakening the application to execute the function.

(2) When the application runs in the background, the application is woken up to perform the function.

(3) When the application is not started, the application is woken up to perform the function.

Of course, other possible wake-up situations may also be included, which is not limited in this embodiment of the application.

Specifically, in order to wake up an application, a wake-up word of the application needs to be set first, and then when input voice data can be matched with the wake-up word, it is determined that a target object wants to wake up the application, so that the application that needs to be woken up needs to register the wake-up word with extended voice before a subsequent process is implemented. For example, for an instant messenger application, wake-up words such as "open XX application", "login", "send message", "cancel" and the like may be registered.

Referring to fig. 8A, a schematic page diagram provided in the embodiment of the present application is shown. The target object may start the system voice, pull up the extension voice, for example, the target object voice inputs "hello, X car", starts the system voice of the current vehicle-mounted terminal device and pulls up the extension voice, and a screen as shown in fig. 8A is presented, in which selectable shortcut options are displayed, of course, the target object may also be manually operated for selection, or may also be voice-entered for selection.

Step 702: and responding to third voice data input by the target object aiming at the target application, and performing voice endpoint detection on the third voice data.

In the embodiment of the application, when the voice control mode is used for controlling the application, the voice start recording is expanded to obtain the third voice data input by the target object, and meanwhile, since not all the durations in the voice data input by the target object contain text information, valid voice data, namely a part of voice segments of the target object which is actually speaking, can be detected from the third voice data by detecting the voice endpoint.

Taking the above fig. 8A as an example, the user may enter the third voice data "send chat message" by using a voice input method.

Specifically, based on the position of the voice segment, the states detected by the voice endpoint may include:

(1) the start state, i.e. the starting position of a valid speech segment.

(2) And the pause state is a position which is not longer than the set time length from the last detection of the effective voice, or the target object requests to pause the recording.

(3) The ending state, i.e. the termination position of a valid speech segment.

(4) The timeout state, i.e., no valid speech is detected for more than a certain length of time.

(5) The network connection overtime state means that the connection with the cloud server exceeds a certain time length.

(6) A pause resume state, i.e. a state from the pause state to when a valid speech is detected, or the target object requests to resume recording.

Step 703: if the voice starting position in the third voice data is detected, performing voice recognition on the third voice data from the voice starting position until the voice ending position in the third voice data is detected.

In a specific implementation process, when a voice start position in the third voice data, that is, a position where valid voice appears is detected, the voice end detection process may be performed on the third voice data from the position until a voice end position is detected, that is, an end state is detected, so that the data amount of voice recognition can be reduced. The speed of speech recognition is further improved.

Step 704: and determining whether to wake up the target application based on the obtained voice recognition result.

Step 705: if the determination result in step 704 is yes, that is, it is determined that the target application is woken up, a wake-up operation is performed on the target application.

Specifically, for the obtained voice recognition result, the registered wakeup word may be traversed, so as to determine whether the voice recognition result includes the registered wakeup word, and if so, it is determined that the target application needs to be woken up, so as to perform a wakeup operation, for example, the target application is opened, and the voice recognition result is sent to the target application, so that the target application performs corresponding processing, or the voice recognition result is converted into a corresponding control instruction and sent to the target application, so as to implement wakeup control on the target application, so that the target object does not need to perform manual wakeup control, so as to improve driving safety.

Referring to fig. 8B, when the "send chat message" matches the chat application, a wake-up operation is performed on the chat application, so as to present a chat message sending interface as shown in fig. 8B, the target object may record, for the chat application, a chat voice to be sent, and send the chat voice, where the chat object may be determined based on an operation or voice selection of the target object, or may be a chat object corresponding to an activation page of the chat application.

In the embodiment of the application, in view of the problem of false wake-up, a mode of registering wake-up words in page scenes is adopted, specifically, each page scene corresponds to a different wake-up word set, when a target application switches the active page, the wake-up word set corresponding to the active page is obtained from the target application, and the wake-up word registration is performed based on the wake-up word set, and each page scene needs to finish a complete process shown in fig. 7.

In a possible implementation manner, the wake-up word set may be divided according to a page scene type, for example, into a login scene, a message sending scene, a voice chat scene, or a video chat scene, where the same page scene may correspond to the same wake-up word set, and when a new page is switched, the wake-up word set corresponding to the page scene is obtained and registered. For example, when the target application is switched from page 1 to page 2, the wake-up word set corresponding to the page scene type of page 2 is acquired for registration.

In another possible implementation manner, in order to control each page more accurately, the wake-up word set may also be set for each page, and when switching to a new page, the wake-up word set corresponding to the page is acquired and registered. For example, when the target application is switched from page 1 to page 2, the wake word set of page 2 is acquired for registration.

Fig. 9 is a schematic view of an architecture of a voice control system according to an embodiment of the present application. In the embodiment of the present application, the above-mentioned voice control method process may be implemented by the architecture diagram shown in fig. 9. In the framework, the MPP mainly comprises a light application MPP, an interface component Moss, a voice component Speech and a cloud server, wherein the interface component Moss and the voice component Speech form the extended voice, the interface component Moss and the voice component Speech can be two parts of the same process or two independent processes respectively when in actual application, for example, the voice component Speech is used as an independent process module, and the interface component Moss is used as an adaptation layer of system voice and speed. It should be noted that the function of the cloud server shown in fig. 9 can also be implemented in the vehicle-mounted terminal device, and the description is specifically given by taking the implementation of the cloud server as an example. The introduction of each module is as follows:

first, interface module Moss

The Moss layer is mainly used as an encapsulation layer of an operating system service interface and a third party service interface, and encapsulates the third party service interface provided by the light application in the vehicle-mounted terminal device and a system service interface provided by the operating system, for example, a voice data acquisition interface for realizing voice data acquisition, as shown in fig. 9, the Moss layer includes the following two modules:

(1) the Application Interface management module, i.e., the client module shown in fig. 9, encapsulates a mobile platform Application Program Interface (API), which may include an API provided by an Application, a light Application, and an extended voice itself.

(2) The system interface management module, that is, the server module shown in fig. 9, encapsulates a system service interface provided by the operating system, and can apply for an audio focus from the operating system to implement a recording (recorder) function, that is, the usage right of the voice acquisition device is initiated by calling a voice data acquisition interface provided by the operating system, and after the application is successful, the voice data acquired by the voice acquisition device can be acquired. In addition, the server module further encapsulates a transmission protocol for communicating with the operating system, for example, a protocol buffer (protocol buffer, which is a mixed language data standard) message protocol shown in fig. 9 may be adopted, and of course, other possible protocols may also be adopted, which is not limited in this embodiment of the present application.

Referring to fig. 9, the Moss and the MPP, and the Moss and the Speech may communicate through Inter-Process Communication (IPC), and arrows shown in fig. 9 indicate data flows in the voice recognition Process.

Second, Speech component Speech

Speech is a text fusion Speech module, is a core module for implementing a Speech control process in the embodiment of the present application, can implement processes such as Speech recognition and semantic recognition in Speech control, and may include the following modules:

(1) the system comprises a voice recognition and execution (VrLogic) module, wherein the VrLogic module is mainly interacted with a semantic related service module of a cloud server and is used for using an online semantic recognition function provided by the cloud server. The VrLogic module comprises an IntraDmResponse submodule and a DmActomic submodule, the IntraDmResponse submodule is an interface module which is awakened and called inside the Speech, interface calling and responding between the insides are achieved, and the DmActomic submodule is used for achieving interaction of the cloud server to achieve an online semantic recognition function.

(2) The AISDK module, the AISDK module has built-in multiple speech processing ability based on artificial intelligence, for example as shown in FIG. 9, the AISDK module contains VAD submodule piece, ASR submodule piece, NLU module and awakens the engine submodule piece, VAD submodule piece is used for realizing the VAD function, ASR submodule piece is used for realizing the ASR function, NLU module is used for realizing the relevant function of semantic identification, awakens the engine submodule piece and is used for realizing the relevant function of awakening control.

(3) The application management (clientanger) module is used to maintain 6 states of the VAD sub-module, i.e. the start state, pause state, etc. mentioned in the above embodiments, and is also used to register skills (such as atomic capability and wakeup word), wake-up, recognition and callback functions of the application.

(4) When a plurality of sessions exist, the session management (Sessionmanager) module is required to perform session management, so that session interruption and multi-round logical management functions of the sessions can be realized.

(5) The dispatch module is mainly used for realizing the dispatch of the domain and arbitration results and returning the voice data returned by the cloud server to different applications or light applications.

(6) The SpeechMsgleister interface is a calling interface externally provided for Speech and used for receiving and managing information sent by other external processes.

(7) The spechclientservice interface is a call interface provided for a spech pair, and is used for coordinating internal calls, when the spechclientservice interface receives information sent from the outside, the spechclientservice interface needs to be called to send the information to an internal corresponding module, for example, the information is sent to the Client management module in fig. 9.

Referring to fig. 9, when performing voice recognition, a spechmsgleistener interface of spech receives information sent from the outside, and notifies an internal Client management module through a spechlientservice interface, the Client management module performs registration of Client atomic capability, pulls up an engine module through a Sessionmanager module, transmits voice data into an AISDK for recognition, and if online voice recognition is required, can call a VrLogic module to interact with a cloud server, and distributes an obtained voice result through a Dispatcher module, that is, calls a spechlientservice interface to send to different clients.

Third, cloud server

The cloud server mainly implements semantic recognition related functions, and as shown in fig. 9, the cloud server may include the following modules:

(1) a dialogue Management agent service (Dmproxy Server) module, which is an access layer of the cloud Server and is used for realizing interaction with the extended voice, for example, receiving Context information and ASR results sent by the extended voice, and returning finally obtained Dialogue Management (DM) results to the extended voice.

(2) And a dialogue management service (Dm Server) module, configured to perform semantic arbitration, select a preferred semantic recognition result, as shown in fig. 9, where the Dm Server module sends the Context information and the ASR result to the NLU module for semantic recognition, performs semantic arbitration based on the semantic recognition result returned by the NLU module, and selects a final preferred NLU result.

(3) And the NLU service (Server) module is used for receiving the application description information and the ASR result transmitted by the Dm Server module, calling the semantic recognition capability of the cloud Server to perform semantic recognition, and returning the semantic recognition result to the Dm Server module. As shown in fig. 9, the cloud server may integrate speech recognition capabilities such as DingDang (a semantic recognition engine), aiseech, and a self-built semantic recognition model in advance.

(4) The system comprises a skill service (SkillServer) module and a script configuration module, wherein the script configuration module stores event configuration information, and can convert action based on the event configuration information of the script configuration module, and then the action is returned to the vehicle-mounted terminal equipment through a DM Server module and a Dproxy Server module. Referring to fig. 9, the Dm Server module sends the NLU result to the SkillServer module, so that the SkillServer module can perform action conversion based on the event configuration information of the script configuration module.

Four, light application MPP

Light applications mainly require the provision of speech control related information.

(1) The Context information comprises account information, location, APP info, system info and other information, and when the Speech needs the Context information, the Speechclient interface provided by the Speech can be called to send the Speech.

(2) In addition, a vehicle association unification gateway access module (TAACC) information may also be obtained from the light application.

(3) When the voice recognition is completed, the display interface of the light application needs to perform corresponding display or broadcast control operation and the like, at the moment, a SpeechmessageHandler interface can be called to send relevant information to a relevant module of the MPP, if the display needs to be performed through a User Interface (UI), the SpeechmessageHandler interface is called to send the information to the MPP, the MPP performs corresponding display, or the MPP needs to be controlled to perform music playing, the SpeechmessageHandler interface is called to send a playing request to the MPP, and the MPP performs corresponding music playing.

Corresponding to the function module in fig. 9, in the embodiment of the present application, a corresponding function class is designed for each function module, please refer to fig. 10, which is a schematic diagram of a relationship between function classes related in spech provided in the embodiment of the present application, where the spech may include the following function classes:

(1) the weappspechmsglisterener class corresponds to the spechmsglisterener interface in fig. 9, which is used for message listening when APP sends a message to spech, and may include several functions as follows:

registered clients// for client registration

-App _ session _ listener// session listening for APP

HandlStartWechRecord// Start recording request for handling APP

HandleStopWechRecord// for handling Wechat stop recording requests

Referring to FIG. 10, the WeAppSpeechMsgListener class may make a call (use) to the SpeechCelientService class.

(2) The SpeechClientService class corresponds to the SpeechClientService interface in FIG. 9, which may include the following functions (denoted private (private) class by-and public (public) class by + by):

client manager// for implementing session management functions

-StartWechRecord// for enabling start-up recording

HandleVadEvent// for handling Vad events

HandleRegisterClient// for handling client registration

HandleRegisterWakeup// for handling Wake-on registration

HandleUnRegisterWakeup// for handling Wake-Up Back registration

+ ProcessRecordData// for processing recorded data

Referring to fig. 10, the spechclientservice class may call a return (wechatropeupcallback) class after a wakeup word registered by the APP during recording is woken, and its functions include a service (p _ client _ service) for processing a voice request and a callback (onwakeupwithoa) after wakeup, where the DOA represents a sound zone; in addition, the spechclientservice class may also call a session listening (wechatrocordsessionlister) class of APP at recording, whose functions include service to handle voice request (p _ client _ service), engine state callback (onengiestate), and return at voice recognition volume (onsrrvolume); in addition, a callback class WechatRecordcallback class of the APP during recording can be called, and functions of the callback class can be called.

(3) The ClientManager class corresponds to the ClientManger module shown in FIG. 9, which may include the following functions:

scene ID map// for scene ID management

Client _ set// for client management

Session manager// for session management

+ RegisterClient// for client registration

+ RegisterWakeupcallback// for wakeup callback registration

+ RegisterListener// for receiving object registration

Referring to fig. 10, a Client manager class may call a Client class to realize management of APP related content, sub-class functions of the Client class include wake-up scene (scene _ wakeup) management, Client type (IClientType) management, APP wake-up event registration (register APP wakeup), wake-up word wake-up callback (appwakeup) of APP registration, and APP voice recognition event callback (appsrevent), the Client class may further call the wake-up scene class to perform wake-up scene setting, functions of the Client class include time setting (event _ set), scene identifier (scene _ id) setting, and add wake-up event (addwake-up event), and the wakeup scene class may further call the wake-up event class to perform wake-up event setting, and functions of the Client class include event identifier (event _ id) management, wake-up word setting (wake-word _ set), and get scene word (getservice).

In addition, the ClientManager class may call a Client management (IClientManager) interface, and the clientclass may call an iclientregister APP domain interface, which is a domain drop interface when the Client registers the APP.

(4) The SessionManager class corresponds to the SessionManager module shown in fig. 9, and implements session management related functions, which may include the following functions:

-engine _// for engine management

+ struct EngineConfig// for Engine Fabric configuration management

-engine _ list _// message for receiving engine

Session _ listeners _// for session listening management

Session _ container _// for session container management

+ register wakeup words// for registration

+ StartWakeup// for triggering wakeup

Referring to fig. 10, the SessionManager class can implement a session management related function under the call of the ClientManager class, and can call a session container (SessionContainer) class to implement a session container related function, where the SessionContainer class includes functions of session list (session _ list) management, distribution (discard), triggered wakeup session (StartWakeupSession), and handling semantics (handlemarmatic); the Session Container class calls the Session class to realize the Session function, and the Session class comprises the functions of distribution (dispather), Session starting (StartSession), SR result processing (HandleSerrult) and the like; the Session class distribution function may be implemented by calling a FrameworkDispather class, which includes functions such as client management (client _ mgr), Session _ mgr, distribution wakeup event (dispathwakeup event), and distribution SR event (DispathSREvent).

In addition, Speech may further include an isossuncelistener class, which inherits (extensions) the SessionManager class, and may implement functions such as recording data return (onrecordsata), recording state switching return (OnrecordStatusChange), and the like.

(5) The Engine class corresponds to the Engine module shown in fig. 9, and implements corresponding Engine-related functions, which may include the following function classes:

-mlEngineListener// for listen engine messages

+ aisdklnit// engine initialization for AISDK

+ singing _ uploading _ wakeup _ keywords _ json// for loading wakeup word file for dingding engine

+ registering register event callback for the registering engine

-oneventCallback// for implementing event callback function

-onWakeupCallback// for implementing wake-back function

Referring to fig. 10, the Engine class may call a Recorder class to obtain related voice data, the Recorder class may implement functions of management of a client (mlPCClient) of IPC communication, return of recorded data (ondataprocess), and start of recording (startRecorder), the Recorder class may call a platformmlPC class to implement related functions of IPC communication, and the platformmlPC class may specifically include functions of opening a process queue (open _ queue), closing (close _ queue), and applying for an audio focus (requestAudioFocus).

In order to implement the function of the Engine class, the Engine class may further call an AISDK to implement a related function, and call an ienginelisener class to obtain Engine related information, where the ienginelisener class includes functions of enumerating an Engine model (enum Engine mode), returning a speech recognition result (onSpeechResult), returning an Engine state (onsenginestate), waking up in a normal mode (onsormalwordwakeup), and the like.

In addition, the Speech may further include an IEngine class, which inherits the Engine class and can implement functions such as model switching (switchMode), starting record (startRecord), register wakeup words (register wakeup words), and unregister wakeup words (unregister wakeup words).

It can be seen that the above various main business logics are implemented by a SeesionManager class, a ClientManager is used for management and maintenance after Speech is used for different clients, an Engine uses a specific Engine module of AISDK, and a SpeechClientServer is responsible for an interface of IPC communication.

Fig. 11 is a schematic diagram of an overall system speech framework according to an embodiment of the present application. The whole framework is mainly divided into a client and a cloud, wherein the client is similar to the voice control system, Speech is used as an independent process module, and a Moss layer is used as an adaptation layer of a system voice control process and the Speech, so that the functions of the Speech and the Moss layer are not repeated.

Referring to fig. 11, an operating system of a vehicle-mounted terminal device includes a system voice and a Tai controller, where the Tai controller refers to a system layer of the vehicle-mounted operating system, and the system voice may also be assisted by using a cloud server, as shown in fig. 11, the cloud may include two parts, one part is the cloud server corresponding to the system voice, and the other part is the cloud server corresponding to the extended voice, and the system voice may perform an interactive acquisition service with the cloud server thereof, for example, after the system voice acquires voice data, the voice data may be sent to the cloud server corresponding thereto, after an access layer of the cloud server receives the voice data, text information in the voice data may be extracted based on its ASR capability, and the text information may be subjected to semantic recognition based on its NLU capability to obtain a corresponding semantic recognition result, and in addition, a semantic recognition result may be combined based on its DM capability, to determine the content of the reply to the target object.

The Moss layer encapsulates a third-party service interface and a system service interface provided by an operating system and serves as an adaptation layer between the Speech and third-party application and the operating system, interaction between the Moss layer and the APP or the MPP can be achieved in an IPC mode, and interaction between the Moss layer and the Speech can also be achieved in the IPC mode.

As shown in fig. 11, after the cloud of the system speech performs ASR processing, the ASR result may be transmitted to the cloud of the extended speech for processing, which is similar to the cloud of the system speech, and the cloud of the extended speech may also perform semantic recognition by using its own NLU capability and perform a session management process by using a DM capability, and finally may synthesize semantic recognition results obtained by the two, and select the most preferable semantic recognition result by using a semantic arbitration method, thereby obtaining a target intention of the target object.

In the following, with reference to the above-mentioned architecture, taking the light application as an applet for example, a process of performing voice control on the light application is described, and refer to fig. 12, which is a schematic flow chart of using voice control by an applet.

Step 1201: and the system voice receives a voice command of opening the XX applet, calls a service interface packaged by the Moss layer and initiates a request of starting the XX applet.

Step 1202: the Moss layer sends a start request to the XX applet.

Step 1203: the Moss layer initiates a play request To a Text To Speech (TTS) module of the vehicle-mounted terminal device, so that the TTS module plays a converted welcome sentence, for example, "welcome To use XX Speech system".

Step 1204: the Moss layer initiates a display request to a UI module of the vehicle-mounted terminal equipment, so that the UI module performs try corpus display, for example, "welcome to XX voice system".

Step 1205: and the Moss layer registers the client atomic capability of the applet to the Speech, and calls the speed to start a voice recognition interface to trigger the interior of the Speech to perform voice recognition preparation.

Specifically, when the atomic capability registration is carried out, the Moss layer acquires the atomic capability data of the applet from the third-party service interface packaged by the Moss layer and sends the atomic capability data to the Speech, so that the Speech can carry out the atomic capability registration of the applet based on the atomic capability data.

Step 1206: the Speech applies for audio focus and switches the internal engine mode to a Speech Recognition (SR) mode.

Specifically, Speech needs to apply for an audio focus to start recording, and functions of audio data acquisition, transmission, online identification and the like are realized. For example, the vehicle-mounted terminal device may include an audio acquisition device, such as a microphone, and a Speech data acquisition interface provided by an operating system of the vehicle-mounted terminal device is encapsulated in the Moss layer, so that the Speech data acquisition interface may be called by spech to apply the operating system for the use permission of the audio acquisition device, that is, the Speech acquisition device needs to be occupied by the spech at present to acquire a Speech component acquired by the audio acquisition device.

In addition, in order to realize speech recognition's speed, except can applying for vehicle terminal's audio acquisition device's permission, can also apply for the audio frequency focus to the high in the clouds, set up the SR mode, can include transmission channel's occupation and speech recognition ability's occupation, in case there is speech data input like this, can transmit it immediately to the high in the clouds and carry out speech recognition to promote speech recognition's speed.

Step 1207: the Speech data are acquired by the Speech, and then the intelligent platform background is requested and sent to the text fusion background through the text information after the Speech recognition. The intelligent platform background and the text fusion background are part of the cloud Server for extending voice, and are used for realizing different functions, for example, the intelligent platform background corresponds to a Dm proxy Server module, a Dm Server module, a SkillServer module, a script configuration module and the like in fig. 9, the text fusion background corresponds to an NLU module and the like, and certainly, in practical application, the text fusion background can be set according to actual requirements.

Step 1208: and the intelligent platform background sends the text information to Speech.

Step 1209: the Moss layer transmits Context information to spech.

Step 1210: and uploading the Context information to the smart platform background by the Speech.

In the embodiment of the application, the Moss layer can acquire Context information including Context information and basic information of the applet from the applet and send the Context information and the basic information to Speech. In addition, updates are also immediately uploaded when Context information changes.

Step 1211: and the text fusion background processes the NLU according to the text information, finally outputs an NLU result, wherein the NLU result mainly comprises a function name and parameters of atomic capability, a TTS text and the like, and returns the NLU result to the intelligent level background.

Step 1212: the intelligent platform background can carry out semantic arbitration and return an arbitrated NLU result to Speech.

It should be noted that, as shown in fig. 12, the smart platform backend sends the NLU result to the cloud backend of the system voice for semantic arbitration, or after acquiring the system DM (including the NLU result identified by the cloud of the system voice) from the cloud backend of the system voice, the smart platform backend performs semantic arbitration.

Step 1213: speech performs UI display based on the text information and the NLU result, for example, Speech can display the text information on a device screen in a streaming screen-up mode, namely, the recognition result can be displayed on the device screen in real time. Although not shown in fig. 12, here, spech may transmit the content to be displayed to the MPP through the Moss layer and display the content.

Step 1214: speech requests a TTS module to broadcast TTS based on the text information and the NLU result. Referring to fig. 12, spech makes a request to the TTS module to play text information or voice control results, etc.

Step 1215: and after the action is generated, the intelligent platform issues the action to the Speech.

Step 1216: speech distributes action to the Moss layer.

Step 1217: the Moss layer notifies the action to the MPP for execution, for example, the Moss layer notifies the MPP to execute the last atomic capability function in a specific manner through an IPC communication method.

Step 1218: and the Moss layer receives a returned result after the MPP is executed.

Step 1219: if multi-round interaction is involved, namely, when the feedback is needed, the Moss layer informs Speech that the feedback is needed.

Step 1220: the Speech requests to the text fusion background again based on the re-entered voice data and the MPP execution result. For example, voice data and MPP execution results may be communicated through the taacc.

Step 1221: and the text fusion background generates a new action again and sends the action to the Speech.

Step 1222: speech sends a new action to the Moss layer.

Step 1223: the Moss layer informs the MPP to perform a new action.

Step 1224: the Moss layer receives the MPP exit indication.

Step 1225: the Moss layer notifies Speech to counterregister the client.

Step 1226: the Speech registers the client atomic capability reversely, the audio focus is released, and the voice recognition process is finished.

Step 1227: the SR mode is stopped, and a Model View viewer (MVW) mode is set.

In the embodiment of the application, if multi-round interaction is involved, the Speech also needs to request the MPP execution result to the text fusion background again, and the MPP also executes action in the second round until feedback is not needed. After the small program finishes the action, the small program registers the client reversely, meanwhile, a Speech recognition stopping interface of the Speech is called, the Speech performs idle mode switching according to the stop function, and the whole recognition process is finished.

In the embodiment of the application, the application of the Speech in the light application is mainly an online Speech recognition function of the Speech, the expanded Speech is pulled up through the system Speech, and the recognized semantic result is returned to the MPP for execution, so that the MPP is opened, the MPP is played, the MPP is searched, the playing control is realized, secondary interaction with the DM is realized in a part of scenes, and the execution result is transmitted to the semantic background by the cloud.

Referring to fig. 13, an exemplary diagram of semantic recognition and execution is shown.

Specifically, Speech can be used for ASR recognition, an input Speech text of 'hello, X car' (the X car is a nickname set for the current vehicle-mounted terminal device) is recognized, an ASR result of the 'hello, X car' is uploaded to an intelligent flat background, and meanwhile, an ASR result of the 'hello, X car' is displayed in a UI mode. After the action corresponding to the 'hello, X vehicle' is obtained by the smart platform background, the action corresponding to the 'hello, X vehicle' is issued to the Speech, and the Speech calls the TTS module to perform corresponding voice broadcast. In addition, the Speech sends the action corresponding to the 'hello, X car' to the Moss layer, the Moss layer pulls up the system voice to execute the action, and meanwhile, if the feedback is not needed, the Speech releases the audio focus and sets MVW mode.

Referring to fig. 14, a schematic flow chart of intent recognition provided in the embodiment of the present application is shown. The MPP is used for realizing context information used by the Speech, such as an account number, system information, a small program foreground state, page text and other information, when semantic recognition is carried out, the MPP transmits basic information of the small program into Speech rich falling domain reference information, and meanwhile, system voice or extension voice carries out ASR recognition on received voice data to obtain an ASR result. On one hand, the ASR result and the context information are transmitted into the system voice to carry out voice recognition, and the system semantics obtained by the system voice recognition is obtained, on the other hand, the ASR result and the context information are transmitted into an NLU server module for expanding the voice, and after the NLU server module obtains the context information and the ASR result, the NLU server module can go to a semantic platform to request to obtain a specific domain and an intention, and finally, the system semantics obtained by the system voice and the semantics output by the NLU server module are subjected to semantic arbitration, and the optimal semantic result is selected and finally issued to the client. The semantic priority can be configured in advance, and when semantic arbitration is performed, an optimal semantic result can be selected according to the semantic priority configuration, so that semantic accuracy is improved, correct voice recognition is accurately guaranteed, and the real intention of a user is completed.

In the embodiment of the present application, the role of Specch in the target application is an offline wake-up function, and the following describes a process of performing voice wake-up on an application by combining the above-mentioned architecture, and refer to fig. 15, which is a schematic flow chart of performing voice wake-up on an application.

Step 1501: the target application registers wake-up words such as 'login', 'message sending', 'cancel' and the like through an interface of Speech.

Step 1502: and the Speech starts to record and loads the recording data into a VAD submodule and a wake-up engine in the AISDK module.

Step 1503: the VAD sub-module performs voice detection and judges whether a VAD start event is triggered.

Step 1504: when the vad start event is triggered, the notification of the vad start event is returned to the AISDK module, wherein the notification comprises the starting time and the state of the voice.

Step 1505: and the awakening engine carries out voice recognition on the voice data and feeds back a voice recognition result to the AISDK module in a streaming manner.

Step 1506: the AISDK module judges whether the voice recognition result contains the awakening words registered by the target application or not by traversing the registered awakening words.

Step 1507: and if the AISDK module determines that the voice recognition result contains the registered awakening words, informing the Speech of the result of awakening needing to be triggered.

Step 1508: speech informs the target application of the awakened text result to perform corresponding processing.

Step 1509: and after the voice data is recorded, the AISDK module informs the VAD submodule to reset the state of the VAD submodule.

To sum up, in the embodiment of this application, a vehicle-mounted voice framework is proposed, support functions such as APP awakening, speech recognition, semantic recognition, can let other businesses such as APP and applet high efficiency insert the back, provide the speech control function, this scheme can accurately guarantee the correct discernment of pronunciation, accurate true intention of accomplishing, make things convenient for the friendly access of business side, separate VFramework layer irrelevant with the business, UI and logic layer separation, the dependency of separation pronunciation and application, the separation of function module, the flow can be fixed a position, stream-type demonstration speech recognition result, reduce user latency, improve experience and feel. In addition, a complete client and cloud integrated framework is provided, flexibility and controllability are achieved, the self service customization requirements are combined, information on the opposite terminal is utilized more fully, and control over the real intention of the user is more accurate.

Referring to fig. 16, based on the same inventive concept, an embodiment of the present application further provides a voice control apparatus 160, applied to a voice control system, the apparatus including:

a capability registration unit 1601, configured to, in response to a start of a voice control function of a target light application, acquire atomic capability data from the target light application, and perform atomic capability registration based on the atomic capability data, where the atomic capability data includes: atomic capabilities that the target light application can provide, each atomic capability for implementing at least one function of the target light application;

an event conversion unit 1602, configured to, in response to first voice data input by a target object for a target light application, convert the first voice data into corresponding voice control events based on each registered atomic capability, where the voice control events include: at least one atomic capability that the target light application needs to invoke in order to achieve the target intent of the first voice data;

a transmission unit 1603 for sending the voice control event to the target light application to enable the target light application to invoke at least one atomic capability to achieve the target intent.

Optionally, the event conversion unit 1602 is specifically configured to:

converting the target intention into a voice control event based on preset event configuration information and registered atomic capabilities, wherein the event configuration information comprises: and guidance information for parameter configuration of at least one atomic capability based on the target intent.

Optionally, the event conversion unit 1602 is specifically configured to:

obtaining application description information from the target light application, the application description information comprising: at least one of basic description information and voice control context information of the target light application;

and performing semantic recognition on the text information based on the application description information to determine the target intention.

Optionally, the event conversion unit 1602 is specifically configured to:

the method comprises the steps of sending application description information and first voice data to a cloud server, and receiving a voice control event returned by the cloud server, wherein the voice control event is obtained by converting a target intention based on registered atomic capability after the cloud server determines the target intention based on the application description information and the first voice data.

Optionally, the voice control system includes an interface component and a voice component, where the interface component encapsulates a third-party service interface provided by the target light application; the capability registration unit 1601 is specifically configured to:

responding to the starting of the voice control function of the target light application, and triggering an interface component to call a third-party service interface to acquire atomic capability data;

atomic capability data is sent to the voice component through the interface component for atomic capability registration at the voice component.

Optionally, the terminal device includes an audio acquisition device, and the interface component encapsulates a voice data acquisition interface provided by an operating system of the terminal device; the apparatus further comprises an apply element 1604 for:

calling a voice component to apply the use authority of the audio acquisition device to an operating system through a voice data acquisition interface;

and calling a voice component to receive first voice data acquired by the audio acquisition device through the voice data acquisition interface.

Optionally, the event conversion unit 1602 is further configured to:

receiving an execution result returned after the target light application executes the voice control event;

if the execution result indicates that feedback control is needed, starting a recording function and acquiring second voice data input by the target object;

Optionally, the event conversion unit 1602 is further configured to:

if the execution result indicates that feedback control is not needed, clearing the registered atomic capability;

Optionally, the apparatus further includes a wakeup unit 1605, configured to:

responding to third voice data input by the target object aiming at the target application, and performing voice endpoint detection on the third voice data;

and when the target application is determined to be awakened based on the obtained voice recognition result, awakening the target application.

Optionally, the wakeup unit 1605 is specifically configured to:

responding to the starting of a voice awakening function of the target application, acquiring an awakening word set corresponding to an activated page in the target application, and registering awakening words based on the awakening word set;

and performing awakening operation on the target application, and sending text information contained in the voice recognition result to the target application.

The apparatus may be configured to execute the method shown in each embodiment of the present application, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the foregoing embodiment, which is not repeated herein.

Referring to fig. 17, based on the same technical concept, an embodiment of the present application further provides a computer device. In one embodiment, the computer device may be the cloud server shown in fig. 1, and as shown in fig. 17, the computer device includes a memory 1701, a communication module 1703, and one or more processors 1702.

The memory 1701 is used for storing computer programs executed by the processor 1702. The memory 1701 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 1701 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1701 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1701 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1701 may be a combination of the above memories.

The processor 1702, which may include one or more Central Processing Units (CPUs), a digital processing unit, or the like. The processor 1702 is configured to implement the above-described voice control method when calling a computer program stored in the memory 1701.

The communication module 1703 is used for communicating with the terminal device and other servers.

The embodiment of the present application does not limit the specific connection medium among the memory 1701, the communication module 1703 and the processor 1702. In the embodiment of the present application, the memory 1701 and the processor 1702 are connected through the bus 1704 in fig. 17, the bus 1704 is depicted by a thick line in fig. 17, and the connection manner between other components is merely illustrative and not limited. The bus 1704 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in FIG. 17, but only one bus or one type of bus is not depicted.

The memory 1701 stores therein a computer storage medium having stored therein computer-executable instructions for implementing the voice control method of the embodiments of the present application. The processor 1702 is configured to perform the voice control recommendation method of the above embodiments.

In another embodiment, the computer device may also be another computer device, such as the vehicle-mounted terminal device shown in fig. 1. In this embodiment, the structure of the computer device may be as shown in fig. 18, including: a communications component 1810, memory 1820, a display unit 1830, a camera 1840, a sensor 1850, audio circuitry 1860, a bluetooth module 1870, a processor 1880, and the like.

The communication component 1810 is utilized to communicate with a server. In some embodiments, a Wireless Fidelity (WiFi) module may be included, the WiFi module being a short-range Wireless transmission technology, through which the computer device may help the user to send and receive information.

The memory 1820 may be used for storing software programs and data. The processor 1880 performs various functions of the terminal device and data processing by executing software programs or data stored in the memory 1820. The memory 1820 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 1820 stores an operating system that enables the terminal device to operate. The memory 1820 may store an operating system and various application programs, and may also store codes for executing the voice control method according to the embodiment of the present application.

The display unit 1830 may also be used to display a Graphical User Interface (GUI) of information input by or provided to the user and various menus of the terminal device. Specifically, the display unit 1830 may include a display screen 1832 disposed on the front surface of the terminal device. The display 1832 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1830 may be used to display various voice control pages in the embodiments of the present application and related display pages on board, such as applet pages or application pages.

The display unit 1830 may also be used to receive input numeric or character information and generate signal inputs related to user settings and function control of the terminal device, and particularly, the display unit 1830 may include a touch screen 1831 disposed on a front surface of the terminal device and capable of collecting touch operations of a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.

The touch screen 1831 may be covered on the display screen 1832, or the touch screen 1831 and the display screen 1832 may be integrated to implement an input and output function of the terminal device, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1830 in the present application may display the application programs and the corresponding operation steps.

The camera 1840 may be used to capture still images, and the user may post comments on the images captured by the camera 1840 through the application. The number of the camera 1840 may be one or plural. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals which are then passed to the processor 1880 for conversion into digital image signals.

The terminal device may further comprise at least one sensor 1850, such as an acceleration sensor 1851, a distance sensor 1852, a fingerprint sensor 1853, a temperature sensor 1854. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.

Audio circuitry 1860, speakers 1861, microphone 1862 may provide an audio interface between a user and a terminal device. The audio circuit 1860 may transmit the electrical signal converted from the received audio data to the speaker 1861, and convert the electrical signal into an audio signal by the speaker 1861 and output the audio signal. The terminal device may be further provided with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1862 converts the collected sound signals into electrical signals, which are received by the audio circuit 1860 and converted into audio data, which are output to the communication component 1810 for transmission to, for example, another terminal device, or output to the memory 1820 for further processing.

The bluetooth module 1870 is used for information interaction with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable computer device (e.g., a smart watch) that is also equipped with a bluetooth module via the bluetooth module 1870 for data interaction.

The processor 1880 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, performs various functions of the terminal device and processes data by operating or executing software programs stored in the memory 1820 and calling data stored in the memory 1820. In some embodiments, processor 1880 may include one or more processing units; the processor 1880 may also integrate an application processor, which primarily handles operating systems, user interfaces, and applications, etc., and a baseband processor, which primarily handles wireless communications. It is to be appreciated that the baseband processor described above may not be integrated into the processor 1880. In the present application, the processor 1880 may run an operating system, an application program, a user interface display, and a touch response, as well as the voice control method of the embodiments of the present application. Further, the processor 1880 is coupled with a display unit 1830.

In some possible embodiments, the aspects of the voice control method provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps in the voice control method according to various exemplary embodiments of the present application described above in this specification, when the program product runs on a computer device, for example, the computer device may perform the steps of the embodiments.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in the context of the present application, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. The voice control method is characterized by being applied to a voice control system, wherein the voice control system comprises a voice component, an interface component and a cloud server, and the interface component encapsulates a third-party service interface provided by a light application and a system service interface provided by an operating system; the method comprises the following steps:

triggering the interface component to call the third-party service interface based on a request for starting a target light application, which is sent by system voice through the system service interface, and acquiring atomic capability data from the target light application; wherein the atomic capability data comprises: atomic capabilities that the target light application can provide, each atomic capability being a corresponding function in the target light application for implementing at least one function of the target light application;

sending the atomic capability data to the voice component, so that the voice component performs atomic capability registration in the cloud server based on the atomic capability data;

sending the first voice data input by a target object aiming at the target light application to the cloud server through the voice component, so that the cloud server determines a corresponding target intention according to a semantic recognition result of the cloud server aiming at the first voice data and a semantic recognition result of the cloud of the system voice aiming at the first voice data, and converts the target intention into a corresponding voice control event based on each registered atomic capability, wherein the voice control event comprises: at least one atomic capability that the target light application needs to call in order to achieve the target intent;

receiving, by the voice component, the voice control event and invoking the interface component to send the voice control event to the target light application such that the target light application invokes the at least one atomic capability to achieve the target intent.

2. The method of claim 1, wherein the method further comprises:

performing, by the voice component, an atomic capability registration based on the atomic capability data, and based on the registered atomic capability, converting the first voice data into a corresponding voice control event by:

3. The method of claim 2, wherein the semantically recognizing the text information and determining the target intention corresponding to the text information comprises:

acquiring application description information from the target light application, wherein the application description information comprises: at least one of basic description information and voice control context information of the target light application;

4. The method of claim 1, wherein the method further comprises:

calling the third-party service interface through the interface component, acquiring application description information from the target light application, and sending the application description information to the voice component, wherein the application description information comprises: at least one of basic description information and voice control context information of the target light application;

sending the application description information and the first voice data to the cloud server through the voice component, and receiving the voice control event returned by the cloud server, wherein the voice control event is obtained by converting the target intention based on the registered atomic capability after the cloud server determines the target intention based on the application description information and the first voice data.

5. The method according to claim 1, wherein the terminal device where the target light application is located comprises an audio acquisition device, and the interface component encapsulates a voice data acquisition interface provided by an operating system of the terminal device;

before sending, by the voice component, the first voice data input by a target object for the target light application to the cloud server, the method further includes:

6. The method of any of claims 1-4, wherein after invoking the interface component to send the voice control event to the target light application, the method further comprises:

7. The method of claim 6, wherein after receiving an execution result returned after the target light application executed the voice control event, the method further comprises:

8. The method of any of claims 1-4, wherein the method further comprises:

9. The method of claim 8, wherein prior to voice endpoint detection of third voice data input for a target application in response to a target object, the method further comprises:

when the target application is determined to be woken up based on the obtained voice recognition result, performing a waking operation on the target application, including:

10. The voice control device is characterized by being applied to a voice control system, wherein the voice control system comprises a voice component, an interface component and a cloud server, and the interface component encapsulates a third-party service interface provided by a light application and a system service interface provided by an operating system; the device comprises:

a capability registration unit, configured to trigger the interface component to call the third-party service interface based on a request for starting a target light application sent by a system voice through the system service interface, acquire atomic capability data from the target light application, and send the atomic capability data to the voice component, so that the voice component performs atomic capability registration at the cloud server based on the atomic capability data, where the atomic capability data includes: atomic capabilities that the target light application can provide, each atomic capability being a corresponding function in the target light application for implementing at least one function of the target light application;

an event conversion unit, configured to send, by the voice component, first voice data input by a target object for the target light application to the cloud server, so that the cloud server determines a corresponding target intent according to a semantic recognition result of the cloud server for the first voice data and a semantic recognition result of the cloud of the system voice for the first voice data, and converts the target intent into a corresponding voice control event based on each registered atomic capability, where the voice control event includes: at least one atomic capability that the target light application needs to invoke in order to achieve the target intent of the first voice data;

a transmission unit, configured to receive the voice control event through the voice component, and invoke the interface component to send the voice control event to the target light application, so that the target light application invokes the at least one atomic capability to achieve the target intent.

11. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the computer program, realizes the steps of the method of any one of claims 1 to 9.

12. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 9.