CN113409805A

CN113409805A - Man-machine interaction method and device, storage medium and terminal equipment

Info

Publication number: CN113409805A
Application number: CN202011202667.1A
Authority: CN
Inventors: 胡孝波
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-09-17

Abstract

The application discloses a human-computer interaction method, a human-computer interaction device, a storage medium and terminal equipment, and belongs to the technical field of artificial intelligence. The method is applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a user-defined acoustic model provided by an access party; the voice interaction component is packaged with SDK related to voice interaction; the N service components are selected from a service component set provided by a developer by an access party according to the product requirements of the access party; a service component for providing at least one service to a terminal device, comprising: receiving audio data collected by a user-defined acoustic model through a voice interaction component; sending audio data to the server through the voice interaction component, wherein the audio data is used for instructing the server to execute audio processing and generate response data; and sending the response data returned by the server to the first service component through the voice interaction component. The method and the device provide possibility for the access party to realize flexible and simple intelligent voice interaction.

Description

Man-machine interaction method and device, storage medium and terminal equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a human-computer interaction method, an apparatus, a storage medium, and a terminal device.

Background

The man-machine interaction refers to the process of information exchange between a person and a computer for completing a determined task in a certain interaction mode by using a certain dialogue language between the person and the computer. With the rapid development of internet technologies and internet of things technologies, intelligent voice interaction is one of mainstream human-computer interaction modes.

Intelligent voice interaction relates to artificial intelligence techniques. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. With the support of the AI technology, the user is enabled to interact with the device directly based on the voice technology.

In the context of intelligent voice interaction, solution flexibility and implementation portability have been a goal pursued by those skilled in the art. That is, how to flexibly implement intelligent voice interaction is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application provides a man-machine interaction method, a man-machine interaction device, a storage medium and terminal equipment, and intelligent voice interaction of smart phones and portable phones can be achieved. The technical scheme is as follows:

on one hand, a man-machine interaction method is provided, which is applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a user-defined acoustic model provided by an access party; a Software Development Kit (SDK) related to voice interaction is packaged in the voice interaction component; the N business components are selected from a business component set provided by a developer by the access party according to the product requirements of the access party; one of the service components is used for providing at least one service for the terminal equipment, and N is a positive integer;

the method comprises the following steps:

receiving audio data collected by the user-defined acoustic model through the voice interaction component, wherein the audio data is input by a user;

sending the audio data to a server through the voice interaction component, wherein the audio data is used for instructing the server to execute audio processing and generate response data matched with the audio data;

sending the response data returned by the server to a first service component through the voice interaction component; in response to the user voice input being a task-type question, the response data is used to trigger the first business component to perform a target operation indicated by the user voice input. .

On the other hand, a man-machine interaction device is provided and is applied to terminal equipment, and the terminal equipment is integrated with a voice interaction component, N service components and a user-defined acoustic model provided by an access party; a Software Development Kit (SDK) related to voice interaction is packaged in the voice interaction component; the N business components are selected from a business component set provided by a developer by the access party according to the product requirements of the access party; one of the service components is used for providing at least one service for the terminal equipment, and N is a positive integer;

the custom acoustic model is configured to collect audio data, the audio data being user speech input;

the voice interaction component is configured to receive audio data collected by the custom acoustic model;

the voice interaction component is further configured to transmit the audio data to a server, wherein the audio data is used for instructing the server to execute audio processing and generate response data matched with the audio data;

the voice interaction component is also configured to send the response data returned by the server to the first service component; in response to the user voice input being a task-type question, the response data is used to trigger the first business component to perform a target operation indicated by the user voice input.

In one possible implementation, the SDK includes: voice recognition SDK, voice synthesis SDK and character recognition SDK;

the path of the acoustic model is arranged under the voice interaction assembly, and the voice interaction assembly externally provides an audio data receiving interface for awakening the terminal equipment.

In one possible implementation, the audio data is used to instruct the server to perform the following audio processing:

performing semantic analysis on the audio data, and acquiring semantic skill data of the audio data based on a semantic analysis result, wherein the semantic skill data comprises: the question intention, the knowledge domain to which the question belongs, the question text and the response data.

In one possible implementation, the first business component is configured to display the response data in non-voice form in response to the user voice input not being a task-type question;

the voice interaction component further configured to play the response data in voice form in response to the user voice input not being a task-type question;

the voice interaction component is further configured to display the response data in a non-voice form in response to the user voice input not being a task-type question and the terminal device not integrating the first business component.

In one possible implementation, a long connection is established between the voice interaction component and the server;

the voice interaction component is further configured to receive a push message issued by the server based on the long connection; and informing a second service component to receive the push message in a directional broadcast mode, wherein the second service component registers a callback function or a register listener to the voice interaction component in advance.

In one possible implementation, the voice interaction component is configured to notify the first service component in a directional broadcast manner to receive the response data, and the first service component has registered a callback function or a registration listener with the voice interaction component in advance.

In one possible implementation, the apparatus further includes:

the microphone comprises a sound source positioning module, a noise source positioning module and a processing module, wherein the sound source positioning module is configured to acquire a first voice signal acquired by a first microphone, and the first voice signal comprises a first sound source signal and a first noise signal; acquiring a second voice signal acquired by a second microphone, wherein the second voice signal comprises a second sound source signal and a second noise signal; acquiring cross-power spectrums of the first voice signal and the second voice signal on a frequency domain; transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function; determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of the voice signal between the first microphone and the second microphone; and carrying out sound source positioning based on the propagation delay, wherein the first microphone and the second microphone are from a microphone array of the terminal equipment.

In one possible implementation, the apparatus further includes:

the echo cancellation module is configured to perform echo cancellation processing on the voice signals received by the microphone array based on the first filter; wherein a filter function of the first filter approaches an impulse response of the loudspeaker to the microphone array infinitely; the voice signal received by the microphone array is determined according to a sound source signal, a noise signal, a voice signal played by the loudspeaker and the impulse response.

In one possible implementation, the apparatus further includes:

the reverberation elimination module is configured to transform the voice signals received by the microphone array from a time domain to a frequency domain to obtain frequency domain signals; carrying out inverse filtering processing on the frequency domain signal based on a second filter to recover a sound source signal; wherein the speech signal received by the microphone array is determined according to a sound source signal, a noise signal and a room impulse response of the sound source.

In another aspect, a terminal device is provided, where the device includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the above human-computer interaction method.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the above-mentioned human-computer interaction method.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a terminal device from the computer-readable storage medium, the computer program code being executed by the processor such that the terminal device performs the above-mentioned human-computer interaction method.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

aiming at intelligent voice interaction, a developer can realize a voice interaction component and various service components in advance, so that an access party can conveniently carry out free combination access on the service components according to own product requirements, and further own terminal equipment is formed; that is, the access party can freely combine the service components to form its own product scheme, and the access can freely select access or no access in the face of various service components provided by the development party. In other words, the access party can define the functions of the terminal device according to the product requirement of the access party.

The voice interaction-based componentization solution can be conveniently applied to various terminal devices, such as IOT (Internet of Things) devices, screen devices, non-screen devices and the like; the access is simple, the customization of the equipment is strong, the access period of an access party can be shortened as much as possible, the development cost is saved, and the flexibility is strong. In summary, the embodiment of the application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a voice interaction system provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment related to a human-computer interaction method provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a target frame platform according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a human-computer interaction method provided by an embodiment of the present application;

fig. 5 is a schematic diagram of a basic architecture of an acoustic front-end acquisition system provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a sound signal and a microphone array according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of performing echo cancellation according to an embodiment of the present application;

fig. 8 is a schematic interaction diagram of a terminal device and a backend server according to an embodiment of the present application;

fig. 9 is a schematic diagram of a background server for pushing a message according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a human-computer interaction device according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, at least one user may be an integer number of users greater than or equal to one, such as one user, two users, three users, and the like. The plurality of users may be two or more, for example, the plurality of users may be two users, three users, or any integer number of users equal to or greater than two.

The embodiment of the application provides a man-machine interaction method, a man-machine interaction device, a storage medium and electronic equipment. The method relates to the field of Artificial Intelligence (AI) and cloud technology.

The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In detail, the artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning and the like.

The key technologies of Speech Technology (ST) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In addition, the method also relates to the field of Cloud technology. The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing. In addition, the cloud technology can also be a general term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like based on cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Illustratively, the method relates to artificial intelligence cloud services in the field of cloud technology. Among them, the so-called artificial intelligence cloud Service is also generally called AIaaS (AI as a Service, chinese). The method is a mainstream service mode of an artificial intelligence platform at present, and in detail, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

Some noun terms or abbreviations related to the embodiments of the present application are described below.

Voice interaction: the method refers to a new generation of interaction mode based on the voice input of a user, and the user can obtain a feedback result given by a machine by speaking. A more typical application scenario is a voice assistant.

Assembling: also called as modularization, means that codes belonging to the same function/service are isolated or divided into independent modules and positioned at a service layer. Exemplary, components include, but are not limited to: desktop components, setup components, notification bar components, music components, video call components, video components, account components, and components formed by various business applications.

Fig. 1 is a schematic structural diagram of a voice interaction system according to an embodiment of the present application.

Referring to fig. 1, a complete speech interactive system comprises: hardware 101, software system 102, acoustic model (also called front-end acoustic acquisition model or front-end acoustic acquisition system) 103, voice AI assistant 104, and upper business Application (APP) 105.

The hardware 101 may be terminal device hardware of the access party, and the software system 102 may be an operating system of the terminal device.

The first point to be noted is that the acoustic model 103 may be customized, and may be an acoustic model provided by the developer or an acoustic model customized by the access party. In other words, in the embodiment of the application, besides providing the acoustic model for the access party by the development party, the access party is supported to use the customized acoustic model, that is, an access mode of the customized acoustic model is provided. That is, since the acoustic models required by different access parties may be different according to different hardware, the present solution provides the capability of accessing the acoustic model defined by the access party to the target framework platform, and realizes that the audio collected by the acoustic model 103 is transmitted to the target framework platform for subsequent audio processing.

The second point of need is that the above-described target framework platform, also referred to herein as a voice interaction component, provides voice interaction functionality, and is the core of implementing a voice interaction-based terminal componentized solution. Illustratively, the voice AI assistant 104 is encapsulated within the above-described target framework platform; wherein, the voice interaction development kit includes but is not limited to: a speech recognition SDK (Software Development Kit), a speech synthesis SDK, a recording recognition SDK, a text recognition SDK, and the like.

Wherein, the various service applications 105 of the upper layer can be referred to as a service component or a function module in the text. The functional implementation of these service components or functional modules also depends on the target framework platform.

Fig. 2 is a schematic diagram of an implementation environment related to a human-computer interaction method provided in an embodiment of the present application.

Illustratively, referring to FIG. 2, the implementation environment includes: a user 201, a terminal device 202 and a server 203. Wherein, a target frame platform 2021 is integrated in the terminal device 202.

In a possible implementation manner, after audio data is collected by an acoustic model on the terminal device, the audio data is transmitted to the target framework platform 2021, and the target framework platform 2021 is responsible for transmitting the audio data to the server 203 through the gateway interface for audio processing; further, after the processing is completed, the server 203 returns a processing result to the target frame platform 2021; illustratively, the target framework platform 2021, after receiving the processing result returned by the server 203, further distributes the processing result to the business component or the functional module for data processing.

In a possible implementation manner, the server 203 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and an artificial intelligence platform. The terminal device 202 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal device 202 and the server 203 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The following explains possible application scenarios of the human-computer interaction method provided by the embodiment of the application.

In a broad sense, the scheme can be applied to any scene needing interaction based on voice AI, such as products of access control machines, smart homes, household appliance industries and the like. Namely, the products can be accessed rapidly and customizably through the scheme. Namely, the embodiment of the application provides a concept of modular access.

Aiming at the scheme, after the target frame platform and each business component are realized, the access is convenient, and the free combination access of various business components can be carried out according to the product requirements of the user; that is, the access parties can freely combine the business components to form their own product solutions. The target framework platform has to be accessed to any access party, and for common components such as the desktop component, the setting component, the account component and the video call component, each access is convenient to select to be accessed or not accessed according to the product requirements of the user. This access is called combined access. Among other things, the target framework platform, also referred to herein as a voice interaction component, provides voice interaction functionality and is central to implementing voice interaction-based terminal componentized solutions. Illustratively, the voice AI assistant is encapsulated in the target frame platform, i.e. a software development kit related to voice interaction is encapsulated in the target frame platform; wherein, the software development kit includes but is not limited to: speech recognition SDK, speech synthesis SDK, recording recognition SDK, text recognition SDK, and the like.

The combined access defines that when each access party realizes own terminal equipment, service components can be freely selected for access according to own product requirements. The self-defined function or self-combination mode is combined access.

Fig. 3 is a schematic structural diagram of a target frame platform according to an embodiment of the present disclosure.

In the embodiment of the application, after the terminal device is awakened through the front-end acoustic model, the audio data acquired by the front-end acoustic model can be subjected to audio processing by using the target framework platform.

According to the scheme, various functions of the background are packaged and butted through the target framework platform, so that an access party does not need to care about specific implementation details of voice interaction, and function transparent transmission is carried out through the target framework platform. That is, the target framework platform is an intermediary for interaction between the background part and the terminal device. Wherein, the part above the target frame platform in fig. 3 belongs to the background part. As shown in FIG. 3, the left portion 301 located above the target frame platform is a portion that is strongly related to speech and semantics, including but not limited to: skill configuration service, ASR (Automatic Speech Recognition technology), NLP, TTS (Text To Speech). The right portion 302, located above the target frame platform, provides various skill services. As shown in fig. 3, this section of the skill service provided includes, but is not limited to: the service system includes an account service, a PUSH service, a weather service, an alarm service, an operation service, and the like, which are not specifically limited in this embodiment of the present application.

Wherein the skills configuration service is used to configure skills that are strongly related to speech and semantics. For example, after the voice is converted into the semantic meaning, the target framework platform may obtain various service messages through the provided skill services, for example, may obtain a weather message through a weather service, and then issue the received weather message to a weather component accessed by the terminal device.

Referring to fig. 3, in the present solution, besides providing a target framework platform for an access party, a plurality of self-contained service components 303 are provided, so that each access party can freely select a service component according to its product requirement and access to its own terminal device. Illustratively, the business components provided by the present solution include, but are not limited to: desktop components, setup components, notification bar components, account components, video call components, video components, music components, IOT components, and the like.

In addition, the target framework platform interacts with the background server and receives various service messages 304 sent by the background server. Illustratively, these service messages include, but are not limited to: TTS message, PUSH message, voice recognition message, weather message, account message, video call message, authorization message, wakeup message, broadcast control message, and the like, which is not specifically limited in this embodiment of the present application.

Based on the target framework platform shown in fig. 3, the present solution can support three docking schemes of the access party.

1. The modular access mode is characterized in that a developer of the scheme realizes various service components, such as a desktop component, a setting component, a notification bar component, a video call component and the like, the access party only needs to match any service component according to own product requirements for use, a target frame platform is a component which needs to be carried, a complete AI voice interaction process can be realized, characteristics related to hardware are shielded, and the operation of the access party only needs to configure the service components according to own product requirements.

2. The SDK access mode supports the SDK access mode simultaneously for the access party with development capability or the access party needing to realize various interactions; the access scheme for SDK is not referred to herein.

3. In the whole machine access mode, for an access party which does not have development experience or does not want to invest in development and also wants to realize voice interaction quickly, the scheme realizes a full-process solution from a front-end acoustic model, ASR, NLP, TTS and various skill configuration services and various AI voice interaction processes, and the access party can not need development, in other words, the scheme also supports the development party to provide a whole machine such as an intelligent sound box for the access party to use. In some embodiments, the openers may select different functional components for complete machine integration based on various application scenarios of the terminal devices, so as to provide various terminal devices with different functional services to the access parties for the different access parties to select and use, which is not specifically limited in the embodiments of the present application.

It should be noted that the implementation of any docking scheme involves speech acquisition, speech recognition, TTS, semantic understanding, and semantic skill execution. In particular, for a componentized access scheme, messaging and state management techniques between multiple components are involved; a User Interface (UI), an audio/video and background double-channel synchronization technology; TTS, ASR, NLP and semantic to skill technology; long connection and PUSH channel technology of App; audio acquisition and echo cancellation techniques, etc.

The man-machine interaction method provided by the embodiment of the present application is explained in detail below by taking a componentized access manner as an example.

Fig. 4 is a flowchart of a human-computer interaction method according to an embodiment of the present disclosure. The execution main body of the method is terminal equipment, and the terminal equipment is access side equipment. The terminal device is integrated with a voice interaction component (namely the target framework platform), N service components and a custom acoustic model provided by an access party. Wherein, the voice interaction component is used for providing voice interaction functions. Such as the voice interaction component packaged with a software development kit associated with voice interaction. Illustratively, the software development kit includes at least: speech recognition SDK, speech synthesis SDK, and text recognition SDK. In addition, the N service components are selected from a service component set provided by a developer by an access party according to the product requirements of the access party; one business component is used for providing a service for the terminal equipment; for example, the video call module provides the video call function for the terminal device. Referring to fig. 4, a method flow provided by the embodiment of the present application includes:

401. and a voice interaction component of the terminal equipment receives audio data collected by the custom acoustic model, and the audio data is input by the user voice.

In the embodiment of the application, the acoustic model of the terminal device is responsible for collecting the sound. And audio data collected by the acoustic model is transferred to the voice interaction component. That is, the audio data received by the voice interaction component is captured by the acoustic model.

In one possible implementation, the audio data is input as user speech, and can be used for task-based questioning, such as playing a piece of music; questions may also be asked about the nature of the question, such as asking how the weather is today; the question may be a chat type question without any purpose, and this is not particularly limited in the embodiment of the present application.

It should be noted that the acoustic model may be a custom acoustic model provided by the access party, in addition to the acoustic model provided by the developer for the access party. The path of the user-defined acoustic model is arranged under the voice interaction assembly, and the voice interaction assembly provides an audio data receiving interface for awakening the terminal equipment to the outside. In other words, the scheme not only provides the acoustic model of the whole machine, but also supports the access of the user-defined acoustic model of the access party, and the detailed process is as follows: an access party prepares a user-defined acoustic model; setting a path of the user-defined acoustic model under a target frame platform; the target frame platform provides an interface for waking up, and the interface is realized by an access party; based on the interface implementation, the audio data collected by the custom acoustic model can be transmitted to the target framework platform.

Based on the steps, the terminal equipment is awakened based on the user-defined acoustic model, and the user can awaken the terminal equipment by speaking the awakening keyword to the microphone of the terminal equipment. The embodiment of the application provides a flexible and various front-end acoustic model access scheme.

In another possible implementation, the basic architecture of the acoustic front-end acquisition system is shown in fig. 5, and includes multiple parts, such as multiple microphone pickup, echo cancellation, microphone array beamforming, far-field reverberation control, nonlinear noise suppression, and critical wake-up. In addition, far-field voice interaction exists, transmission attenuation exists in sound in the transmission process, and noise, interference signals and the like are accompanied, so that a microphone array is adopted at present, the target sound is enhanced, the interference sound is restrained, and the identification accuracy is improved. In fig. 6, 6 microphones are shown, namely microphone 1, microphone 2, microphone 3, microphone 4, microphone 5 and microphone 6. The 6 microphones form a microphone array. In addition, as shown in fig. 6, in addition to the sound signal emitted from the sound source, there are generally reflected sound of the sound source signal, a microphone echo, ambient noise, and ambient null noise, and the like in the environment.

In another possible implementation manner, the acoustic model further includes sound source localization, echo cancellation, and reverberation cancellation before the collected audio data is transferred to the target frame platform.

Illustratively, the sound source localization can be performed by using a generalized cross-correlation method, which is essentially a TDOA (Time Difference Of Arrival) calculation method, in which a cross-power spectrum Of two speech signals is calculated in a frequency domain, and then converted from a Time domain to a Time domain by an inverse fourier transform, and a propagation delay is determined by finding a delay corresponding to a maximum cross-correlation value.

For the embodiment of the present application, the detailed process of sound source localization may be: acquiring a first voice signal acquired by a first microphone of a microphone array of a terminal device, wherein the first voice signal comprises a first sound source signal and a first noise signal; acquiring a second voice signal acquired by a second microphone of the microphone array, wherein the second voice signal comprises a second sound source signal and a second noise signal; acquiring cross-power spectrums of the first voice signal and the second voice signal on a frequency domain; transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function; determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of the voice signal between the first microphone and the second microphone; and carrying out sound source positioning based on the propagation delay. For example, after the propagation delay is determined, the sound source position may be determined according to the sound velocity and the distance between the two microphones, which is not specifically limited in the embodiment of the present application.

For example, two microphones spaced apart by a distance L may be used to capture the sound signal from a distant sourceShown as the following equation: x is the number of₁(t)＝s₁(t)+z₁(t)，x₂(t)＝s₂(t+D)+z₂(t)。

Wherein s is₁(t) and s₂(t + D) is the sound source signal received by the two microphones respectively, z₁(t) and z₂(t) is the noise signal received by each of the two microphones. As shown in fig. 7, when the sound signal x is collected₁(t) and x₂After (t), it is first passed through a filter H₁(t) and H₂(t) filtering, then performing FFT (Fast Fourier Transform), Fast Fourier Transform and calculating power spectrum to obtain X₁(omega) and X₂And omega, then calculating the cross-power spectrum of the two voice signals in the frequency domain, and then converting the cross-power spectrum from the time domain to the time domain through IFFT (Inverse Fast Fourier Transform), thereby obtaining the propagation delay D approximate to the real delay. Where Φ (ω) in fig. 7 is a phase transformation function, which can be selected according to actual situations. After sound source positioning is carried out, a beam pattern can be designed to point to a sound source, and the purposes of reducing noise and interfering the sound source and improving the signal-to-interference-and-noise ratio are achieved.

For echo cancellation, in a terminal device based on voice interaction, in the working process of the terminal device, for example, under the scenes of playing voice or video, a voice interaction instruction input by a user needs to be responded sometimes. In this case, it is necessary to remove the sound signal played by the local speaker from the voice signal received by the microphone array, so that the terminal device can correctly recognize the voice interaction instruction input by the user. In one possible implementation, the sound signal played by the local speaker may be modeled using a speaker-to-microphone array impulse response. Where, x (t) ═ n (t) × r (t) + s (t) + z (t).

Where n (t) is a speech signal played by a speaker, r (t) is an impulse response from the speaker to a microphone array, s (t) is a real sound source signal, z (t) is a noise signal, and x (t) is a speech signal finally received by the microphone array. Illustratively, the echo cancellation process is to obtain a filter h (t) that infinitely approximates the true impulse response r (t).

f(t)＝x(t)-n(t)*H(t)

＝n(t)*r(t)-n(t)*H(t)+s(t)+z(t)

The above-mentioned filter h (t), also referred to herein as the first filter, performs echo cancellation processing on the speech signal received by the microphone array based on the first filter; wherein the filter function of the first filter approaches the impulse response of the loudspeaker to microphone array infinitely; the voice signal x (t) received by the microphone array is determined according to the sound source signal s (t), the noise signal z (t), the voice signal n (t) played by the loudspeaker and the impulse response r (t) from the loudspeaker to the microphone array.

For reverberation elimination, reverberation refers to a phenomenon that a sound signal meets obstacles such as the ground, walls and the like to form reflected sound and is superposed with a real sound source signal. Illustratively, the modeling of reverberation can be described by a Room Impulse Response (RIR).

The final voice signal received by the microphone array is determined according to the sound source signal, the noise signal and the room impulse response of the sound source, that is, x (t) ═ r (t) · s (t) + z (t), x (t) is the final voice signal received by the microphone array, s (t) is the real sound source signal, r (t) is the room impulse response of the sound source, and z (t) is the noise signal.

Converting x (t) to the frequency domain as follows;

illustratively, the method of reverberation cancellation may be estimating a filter

So that

Thereby achieving the purpose of eliminating the reverberation sound.

The above method of removing reverberation is called an inverse filtering method. Wherein the filter

Herein, the textThe second filter is also called as a second filter, and after the speech signal received by the microphone array is converted from the time domain to the frequency domain, the embodiment of the present application performs inverse filtering processing on the obtained frequency domain signal based on the second filter to recover a real sound source signal s (t).

It should be noted that, through the above method steps, the user can wake up the terminal device by waking up the keyword with respect to the microphone array.

402. And the voice interaction component of the terminal equipment transmits audio data to the server, wherein the audio data is used for instructing the server to execute audio processing and generating response data matched with the audio data.

The voice interaction component of the terminal equipment interacts with the background server to obtain response data matched with the received audio data.

Wherein the response data includes a speech form and a non-speech form. For example, for a non-voice form, the response data may be UI data, which is not specifically limited in this embodiment of the application.

In this embodiment of the application, as shown in fig. 8, audio data received by the acoustic model of the terminal device 801 is transferred to the voice interaction component, and the voice interaction component transfers the audio data to the backend server 802 through the gateway interface of the terminal device for audio processing. The terminal device sends the audio data received by the acoustic model to the background server through the voice interaction component, and the audio data is used for indicating the background server to perform audio processing and generate response data matched with the audio data.

In one possible implementation, fig. 8 illustrates a specific architecture of background server 802. As shown in fig. 8, backend servers 802 include, but are not limited to: AIProxy, ASR module, TTS module, target service module, TSKM platform (skills configuration platform) and skills service module.

Based on the background server architecture shown in fig. 8, the background server may perform the following audio processing on the received audio:

4021. and the voice interaction component of the terminal equipment transmits the received audio data to the AIproxy through the gateway interface.

4022. And the AIproxy transmits the received audio data to the ASR module, and the ASR module carries out speech-to-text processing on the audio data to obtain text data, namely an ASR text.

4023. And the AIproxy transmits the received audio data to a target service module, and then semantic skill data is obtained based on the TSKM platform and the skill service module. Aiming at the step, semantic analysis is carried out on the audio data, namely, voice-to-semantic processing is carried out; for example, the speech-to-semantic processing may be performed by the target service module, which is not specifically limited in this embodiment of the present application.

Illustratively, the TSKM platform is also referred to as a skill configuration platform for configuring skills strongly related to semantics, such as weather skills, alarm clock skills, music skills, and the like; and the acquisition of skill data such as weather messages also needs the support of skill service, namely, the TSKM platform pulls specific skill data based on semantic parsing results through interaction with a skill service module. For example, if the semantic parsing result is weather search, the weather message of the current position of the user is pulled through interaction with the weather service.

As an example, the semantic skill data includes: the intent of the question, the knowledge domain to which the question belongs, the text of the question and non-speech form of the response data. Such as: assuming that the received audio data is "S city weather," the corresponding semantic skills data may include: fields-weather, intentions-comprehensive search, question-weather in S city, response data-lightning warning in S city today. As shown above, the semantic skill data includes the question intention (intent), the knowledge domain (domain) to which the question belongs, the question text (Query), and the non-voice response data "there is a lightning warning in S city today".

4024. The TTS module carries out text-to-speech processing on the non-speech response data to obtain TTS data, namely the speech response data.

In this embodiment of the application, the background server may return the ASR text, TTS data, and response data to the voice interaction component of the terminal device. And the voice interaction component can respond to the voice input of the user based on the data when receiving the data.

Referring to fig. 8, for interaction between a voice interaction component and each service component of a terminal device, the embodiment of the present application provides three callback modes. That is, after the voice interaction component receives the data returned by the background server 802, the embodiment of the present application provides three interaction schemes between the components.

For the interaction mode of cross-process message distribution, if data returned by the background server 802 needs to be transmitted to a certain service component integrated by the terminal device, the service component needs to register a callback function or register a listener when being started, so that the voice interaction component performs remote callback after receiving the data sent by the background server. That is, the embodiment of the present application further includes the following step 403.

403. And the voice interaction component of the terminal equipment transmits the response data to the first service component, and in response to the fact that the voice input of the user is a task-type question, the response data is used for triggering the first service component to execute target operation indicated by the voice input of the user.

That is, the response data is used to trigger the first business component to respond to the user speech input. For example, the music component is instructed to play a certain piece of music or the alarm clock component rings at a target time, and the like, which is not specifically limited in this embodiment of the application.

In one possible implementation manner, the response data is sent to the first service component through the voice interaction component, which includes but is not limited to: informing the first service component to receive the response data in a directional broadcast mode through a voice interaction component; wherein the first service component has previously registered a callback function (callback) or a registration listener (listener) with the speech interaction component.

A callback function is a function called by a function pointer. If the pointer to the function is passed as a parameter to another function, the function pointed to by the pointer is called a callback function when the pointer is used to call the function. The callback function is not directly called by the implementation side of the function, but is called by another side when a specific event or condition occurs, and is used for responding to the event or condition. In most cases, a callback is a method that is called when certain events occur. For example, a user may buy something in a store but just bad, and then the user may leave a contact at a clerk, store goods in a few days, and the clerk may contact the user based on the contact provided by the user, and then go to the store to pick up the goods. In this example, the contact information of the user is a callback function, the user leaves the contact information to a store clerk, the store clerk registers the callback function, the store is ready to be an event triggering callback association later, the store clerk notifies the user that the call-back function is called, and the user gets the store to get the good and responds to the callback event.

In the embodiment of the application, the user voice input can be directed to a task scene, wherein the task-type dialog is to complete a specific task, for example, the booking of an air ticket not only needs to answer the user, but also needs to inquire the condition of the air ticket and execute a corresponding action. Additionally, user voice input may also be directed to non-task scenarios, such as question-and-answer scenarios and chat scenarios. The question-answer type dialogue is mainly used for answering user questions, and is equivalent to an encyclopedic knowledge base, such as how to refund train tickets and what matters need to be noticed when a user sits on an airplane, and generally only needs to answer questions without executing tasks. Chatty conversations are open, generally without task targets, and without hard defined answers.

Wherein, for different types of user voice input, the manner of the response data triggering the first service component to respond to the user voice input is also different. In another possible implementation manner, in response to that the user voice input is not a task-type question, the response data is used for triggering the first service component to display the response data in a non-voice form through the UI template; it should be noted that, in addition to the above callback method for cross-process message distribution, the embodiment of the present application further includes a callback method based on an original thread, and the detailed process is as follows.

In another possible implementation manner, the voice interaction component of the terminal device plays the response data in the form of voice.

In other words, since the voice interaction component encapsulates various voice interaction development kits, the voice interaction component supports voice playing, so that response data in a voice form can be played through the voice interaction component. For example, assuming that the user speech input is not a task-type question, the responsive data may be played through the speech interaction component in speech form.

Illustratively, response data is usually returned to the terminal device from the background server in the form of a network request, so that the callback mode can be directly performed on a thread which initiates the network request.

It should be noted that, in addition to the above callback method for cross-process message distribution, the embodiment of the present application further includes a callback method for a UI thread, which is described in detail below.

In another possible implementation manner, the voice interaction component of the terminal device displays the non-voice form response data through the UI template.

In the embodiment of the application, for data that needs to be displayed based on the UI template, for example, response data in the form of UI data, may call back to the UI thread of the voice interaction component, and then perform interactive presentation through the UI template provided by the voice interaction component. Illustratively, assuming that the user voice input is not a task-type question and the terminal device is not integrated with the corresponding business component, the non-voice form of the response data may be displayed through the UI template of the voice interaction component.

The scheme performs callback of the corresponding target on the data issued by the background server, and provides greater convenience for the access party.

The method provided by the embodiment of the application has at least the following beneficial effects:

aiming at intelligent voice interaction, a developer can realize a voice interaction component and various service components in advance, so that an access party can conveniently carry out free combination access on the service components according to own product requirements, and further own terminal equipment is formed; that is, the access parties can freely combine the business components to form their own product solutions. The access can be freely selected to be accessed or not accessed in the face of various service components provided by developers. In other words, the access party may define the functions of its terminal device according to its product requirements. The voice interaction-based modular solution can be conveniently applied to various intelligent terminal devices, such as IOT (input operation Table) devices, screen devices, non-screen devices and the like; the access is simple, the customization of the equipment is strong, the access period of an access party can be shortened as much as possible, the development cost is saved, and the flexibility is strong. In summary, the embodiment of the application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

In another embodiment, in the interaction between the semantic interaction component and the background server, in addition to the active request and response manner, the embodiment of the present application also supports a PUSH mechanism. Illustratively, the scheme provides a centralized PUSH transceiving mode based on registration. As shown in fig. 9, a PUSH management channel is set in the voice interaction component to receive a message pushed by the background server to the terminal device in real time. Wherein, a long connection is established between the voice interaction component and the acquired server; as an example, in the present solution, PushManager is used to establish a long connection with a backend server. PushManager, among other things, provides the ability to receive push messages from an acquired server.

In the embodiment of the application, the PUSH functions required by each service component, voice interaction component and the like integrated on the terminal equipment are supported by the PushManager component; the processing ensures that the system only has one uniform PUSH channel, thereby not only being convenient for management, but also saving the memory and the flow. Based on the above description, the method provided in the embodiment of the present application further includes: the terminal equipment receives a push message issued by the server based on long connection through the voice interaction component; informing a second service component to receive the push message in a directional broadcast mode through a voice interaction component; wherein the second service component has previously registered a callback function or a registration listener with the voice interaction component.

In other words, in order to enable each service component integrated by the terminal device to receive the push message of the background server, the service components also need to register with the voice interaction component; therefore, after receiving the push message, the voice interaction component can inform the service component needing to receive the push message in a directional broadcasting mode; the service component receiving the push message does not need to operate in a persistent mode, so that the expenses of a memory and a Central Processing Unit (CPU) can be saved, and the performance is provided.

The following specifically exemplifies a componentized access scheme based on intelligent voice interaction provided in the embodiment of the present application.

After the developer realizes the voice interaction assembly and various service assemblies, the access party can carry out free combination access of the service assemblies according to the product requirements of the access party. Illustratively, developers can implement components such as a desktop component, a settings component, an account component, and a video call component. The access party can freely combine the service components to form a product scheme of the access party. In addition, the embodiment of the present application also supports the access party to customize the service component according to the external interface provided by the developer, and this embodiment of the present application is not specifically limited to this.

The following illustrates a modular access scheme by taking an example of accessing a video call component. If the access party wants to have the video call function of its own terminal device, the access party can access the video call component (video call) and the voice interaction component (target framework platform) provided by the developer, that is, the two components are integrated on the terminal device of the access party. Illustratively, the video call process based on these two components is specifically as follows:

and step a, the acoustic model of the terminal device collects the voice of the user, such as' calling.

And b, transmitting the collected user voice to the voice interaction component by the acoustic model.

And step c, the voice interaction component transmits the user voice to the background server.

D, the background server performs voice-to-semantic processing and skill service matching on the received user voice to obtain response data and sends the response data to the voice interaction component;

step e, the voice interaction component informs the video call component to receive the response data in a broadcasting mode;

f, the response data starts an interactive flow of the video call component to initiate a call; namely, the video call component initiates a video call and waits for the call after the other party is connected.

Illustratively, the broadcast mode is a directional broadcast mode, and a receiving process adopting the directional broadcast mode does not need to reside, so that the performance of the system is saved; that is, the current broadcast implementation is directed broadcast, i.e., only a designated process can receive the broadcast; the same logical access is adopted for other service components. The combined access scheme can quickly perform minimized function combination through atomic requirements, not only realizes the requirements of an access party, but also is convenient.

Fig. 10 is a schematic structural diagram of a human-computer interaction device according to an embodiment of the present application. Applied to a terminal device, referring to fig. 10, the terminal device integrates a voice interaction component 1001, N service components 1002, and a custom acoustic model 1003 provided by an access party; a software development kit SDK related to voice interaction is packaged in the voice interaction component 1001; the N service components 1001 are selected by an access party from a service component set provided by a development party according to the product requirements of the access party; a service component 1001 for providing at least one service to a terminal device, N being a positive integer;

a custom acoustic model 1003 configured to collect audio data, the audio data being user speech input;

a voice interaction component 1001 configured to receive audio data collected by the custom acoustic model;

a voice interaction component 1001 further configured to transmit the audio data to a server, the audio data being used to instruct the server to perform audio processing and generate response data matching the audio data;

the voice interaction component 1001 is further configured to send the response data returned by the server to the first service component; in response to the user voice input being a task-type question, the response data is used to trigger the first business component to perform a target operation indicated by the user voice input.

In one possible implementation, the apparatus further includes:

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the human-computer interaction device provided in the above embodiment, only the division of the service components is exemplified when performing human-computer interaction, and in practical applications, the function distribution may be completed by different service components as needed, that is, the internal structure of the device is divided into different service components to complete all or part of the functions described above. In addition, the human-computer interaction device provided by the above embodiment and the human-computer interaction method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

Fig. 11 shows a block diagram of a terminal device 1100 according to an exemplary embodiment of the present application. The terminal device 1100 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal device 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal device 1100 includes: a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and rendering content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one program code for execution by processor 1101 to implement the human-computer interaction methods provided by method embodiments herein.

In some embodiments, the terminal device 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1105 may be one, provided on the front panel of the terminal device 1100; in other embodiments, the display screens 1105 may be at least two, respectively disposed on different surfaces of the terminal device 1100 or in a folded design; in other embodiments, display 1105 may be a flexible display disposed on a curved surface or on a folded surface of terminal device 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different positions of the terminal device 1100. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

The positioning component 1108 is used to locate the current geographic position of the terminal device 1100 for purposes of navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 1109 is used to provide power to various components within terminal device 1100. The power supply 1109 may be alternating current, direct current, disposable or rechargeable. When the power supply 1109 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal device 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

The acceleration sensor 1111 can detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal device 1100. For example, the acceleration sensor 1111 may be configured to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1111. The acceleration sensor 1111 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal device 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user on the terminal device 1100. From the data collected by gyroscope sensor 1112, processor 1101 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1113 may be disposed on a side bezel of terminal device 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal device 1100, the holding signal of the user to the terminal device 1100 can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1101 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1114 may be disposed on the front, back, or side of the terminal device 1100. When a physical key or vendor Logo is provided on the terminal device 1100, the fingerprint sensor 1114 may be integrated with the physical key or vendor Logo.

Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, processor 1101 may also dynamically adjust the shooting parameters of camera assembly 1106 based on the ambient light intensity collected by optical sensor 1115.

The proximity sensor 1116, also called a distance sensor, is usually provided on the front panel of the terminal device 1100. The proximity sensor 1116 is used to capture the distance between the user and the front face of the terminal device 1100. In one embodiment, the processor 1101 controls the display 1105 to switch from a bright screen state to a dark screen state when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal device 1100 is gradually reduced; when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal device 1100 becomes gradually larger, the display screen 1105 is controlled by the processor 1101 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal device 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including a program code, which is executable by a processor in a terminal to perform the human-computer interaction method in the above-described embodiments, is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which includes computer program code stored in a computer-readable storage medium, which is read by a processor of a terminal device from the computer-readable storage medium, and which is executed by the processor to cause the terminal device to execute the above-mentioned human-computer interaction method.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A man-machine interaction method is characterized in that the method is applied to terminal equipment, and the terminal equipment is integrated with a voice interaction component, N service components and a user-defined acoustic model provided by an access party; a Software Development Kit (SDK) related to voice interaction is packaged in the voice interaction component; the N business components are selected from a business component set provided by a developer by the access party according to the product requirements of the access party; one of the service components is used for providing at least one service for the terminal equipment, and N is a positive integer;

the method comprises the following steps:

sending the response data returned by the server to a first service component through the voice interaction component; in response to the user voice input being a task-type question, the response data is used to trigger the first business component to perform a target operation indicated by the user voice input.

2. The method of claim 1, wherein the SDK comprises: voice recognition SDK, voice synthesis SDK and character recognition SDK;

3. The method of claim 1, wherein the audio data is used to instruct the server to perform the following audio processing:

4. The method of claim 1, further comprising:

in response to the user voice input not being a task-type question, displaying, by the first business component, the response data in non-voice form; or the like, or, alternatively,

in response to the user voice input not being a task-type question, playing the response data in voice form through the voice interaction component: or the like, or, alternatively,

and responding to the fact that the user voice input is not a task-type question and the terminal equipment does not integrate the first business component, and displaying the response data in a non-voice form through the voice interaction component.

5. The method of claim 1, wherein a long connection is established between the voice interaction component and the server; the method further comprises the following steps:

receiving, by the voice interaction component, a push message issued by the server based on the long connection;

and informing a second service component to receive the push message in a directional broadcast mode through the voice interaction component, wherein the second service component registers a callback function or a register listener to the voice interaction component in advance.

6. The method of claim 1, wherein the sending the response data returned by the server to a first service component through the voice interaction component comprises:

and informing the first service component to receive the response data in a directional broadcast mode through the voice interaction component, wherein the first service component registers a callback function or a register listener to the voice interaction component in advance.

7. The method of claim 1, wherein during the capturing of audio data, the method further comprises:

acquiring a first voice signal acquired by a first microphone, wherein the first voice signal comprises a first sound source signal and a first noise signal;

acquiring a second voice signal acquired by a second microphone, wherein the second voice signal comprises a second sound source signal and a second noise signal;

acquiring cross-power spectrums of the first voice signal and the second voice signal on a frequency domain;

transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function;

determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of the voice signal between the first microphone and the second microphone;

and carrying out sound source positioning based on the propagation delay, wherein the first microphone and the second microphone are from a microphone array of the terminal equipment.

8. The method of claim 1, wherein during the capturing of audio data, the method further comprises:

based on a first filter, carrying out echo cancellation processing on a voice signal received by a microphone array;

wherein a filter function of the first filter approaches an impulse response of the loudspeaker to the microphone array infinitely; the voice signal received by the microphone array is determined according to a sound source signal, a noise signal, a voice signal played by the loudspeaker and the impulse response.

9. The method of claim 1, wherein during the capturing of audio data, the method further comprises:

converting a voice signal received by a microphone array from a time domain to a frequency domain to obtain a frequency domain signal;

carrying out inverse filtering processing on the frequency domain signal based on a second filter to recover a sound source signal;

wherein the speech signal received by the microphone array is determined according to a sound source signal, a noise signal and a room impulse response of the sound source.

10. A man-machine interaction device is characterized in that a voice interaction component, N service components and a custom acoustic model provided by an access party are integrated in terminal equipment; a Software Development Kit (SDK) related to voice interaction is packaged in the voice interaction component; the N business components are selected from a business component set provided by a developer by the access party according to the product requirements of the access party; one of the service components is used for providing at least one service for the terminal equipment, and N is a positive integer;

11. The apparatus of claim 10, wherein the SDK comprises: voice recognition SDK, voice synthesis SDK and character recognition SDK;

12. The apparatus of claim 10, wherein the audio data is configured to instruct the server to perform the following audio processing:

13. The apparatus of claim 10,

the first business component configured to display the response data in non-voice form in response to the user voice input not being a task-based question;

14. A terminal device, characterized in that it comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the human-computer interaction method according to any one of claims 1 to 9.

15. A computer-readable storage medium, having stored therein at least one program code, which is loaded and executed by a processor, to implement the human-computer interaction method according to any one of claims 1 to 9.