CN116030790A

CN116030790A - Distributed voice control method and electronic equipment

Info

Publication number: CN116030790A
Application number: CN202111234615.7A
Authority: CN
Inventors: 孟亚洲; 兰国兴; 白立勋; 俞清华; 石巍巍
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2023-04-28
Also published as: WO2023065854A1

Abstract

A distributed voice control method and electronic equipment relate to the technical field of terminals and can improve the efficiency of voice control. The method comprises the following steps: the method comprises the steps that a first terminal responds to voice information input by a user, inputs the voice information into a first model, and obtains feature information corresponding to the voice information through the first model; the first terminal sends the characteristic information to a second terminal so that the second terminal inputs the characteristic information into a second model, operation information corresponding to the voice information is determined through the second model, corresponding operation is executed according to the operation information, the first model exists in the first terminal, and the second model exists in the second terminal.

Description

Distributed voice control method and electronic equipment

Technical Field

The application relates to the technical field of terminals, in particular to a distributed voice control method and electronic equipment.

Background

With the popularity of smart devices, more and more users can use various smart devices in various smart scenarios. Wherein, the intelligent scene includes a voice control scene. In the voice control scenario, other devices of the distributed voice control may be voice controlled by some electronic device. For example, in the scenario shown in fig. 1, a user inputs voice information "turn on a television" to a mobile phone, the mobile phone parses operation information represented by the voice information (i.e., the user wants to turn on the television), generates a control signal, and sends the control signal to the television so as to control the television to turn on.

In some aspects, the handset may parse the user's voice information via a machine learning model. However, since different devices may come from different vendors, in the case where there is a new type or device from a new vendor to establish a wireless connection with the handset, the handset vendor typically needs to retrain the machine learning model so that the model can properly parse the voice information used to control the new type or device from the new vendor. Therefore, in the prior art, frequent retraining of the model leads to large development amount of mobile phone manufacturers, and the mobile phone manufacturers need to retrain and maintain the whole model continuously in a later period. In addition, the model running in the mobile phone is complex, the load is heavy, the processing time delay is high, and the voice control efficiency is low.

Disclosure of Invention

The distributed voice control method and the electronic device can improve the efficiency of voice control.

In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:

the first aspect provides a distributed voice control method, which can be applied to a first terminal or a component (such as a chip system) capable of realizing a first terminal function, wherein the first terminal responds to voice information input by a user, inputs the voice information into a first model, and obtains feature information corresponding to the voice information through the first model, and the first model exists in the first terminal; the first terminal sends the characteristic information to the second terminal so that the characteristic information is input into a second model in the second terminal, operation information corresponding to the voice information is determined through the second model, corresponding operation is executed according to the operation information, and the second model exists in the second terminal.

Compared with the prior art that the first terminal (such as a mobile phone) needs to finish the process from the feature extraction of the voice to the identification of the operation information, so that the calculation amount of the first terminal is large and the voice control efficiency is low, the technical scheme of the application decouples the feature information extraction and the operation information identification process in a voice control scene such as an intelligent home device. For example, a complete model for speech control may be split into at least a first model and a second model. The first model exists in the first terminal, and the first terminal can extract feature information corresponding to the voice information through the first model. The second model exists in the second terminal, and the second terminal can identify the operation information through the second model (such as each intelligent home device controlled by the mobile phone). Because the first terminal does not execute all steps in the voice control, such as operation of operation information identification, the calculated amount is reduced, the running speed of the first terminal can be improved, and the efficiency of the voice control is further improved.

In one possible design, the first model is a model trained based on at least one first sample data comprising first speech information, characteristic information of the first speech information being known, and/or,

The second model is a model trained based on at least one second sample data, the second sample data including first characteristic information, the operation information corresponding to the first characteristic information being known.

In one possible design, the first terminal, the at least one second terminal are in the same local area network;

or the first terminal and the at least one second terminal are in different local area networks.

In one possible design, the first terminal sends the feature information to the second terminal, including: the first terminal broadcasts the characteristic information to the second terminal.

In one possible design, the characteristic information corresponding to the voice information includes a sound spectrum corresponding to the voice information and phonemes of the sound spectrum.

A second aspect provides a distributed speech control method, the method comprising:

the second terminal receives characteristic information corresponding to the voice information from the first terminal; the characteristic information is that a first terminal inputs voice information into a first model and is obtained through the first model, and the first model exists in the first terminal;

the second terminal inputs the characteristic information into a second model, and determines operation information corresponding to the voice information through the second model, wherein the second model exists in the second terminal;

And the second terminal executes corresponding operation according to the operation information.

In one possible design, the second terminal performs a corresponding operation according to the operation information, including:

if the operation information corresponding to the voice information is the operation information matched with the second terminal, the second terminal executes the target operation according to the operation information corresponding to the voice information; and/or the number of the groups of groups,

and if the operation information corresponding to the voice information is not the operation information matched with the second terminal, discarding the operation information by the second terminal.

In one possible design, the first model is a model trained based on at least one first sample data, the first sample data including first speech information, characteristic information of the first speech information being known; and/or the second model is a model trained based on at least one second sample data, wherein the second sample data comprises first characteristic information, and operation information corresponding to the first characteristic information is known.

In one possible design, the first terminal and the second terminal are in the same local area network, or the first terminal and the second terminal are in different local area networks.

A third aspect provides a speech recognition method that may be applied to a first terminal or a component (such as a system-on-a-chip) implementing a function of the first terminal. Taking the first terminal as an example, the method comprises the following steps:

the method comprises the steps that a first terminal receives first voice information of a first language input by a user;

the first terminal responds to the first voice information, inputs the first voice information into a first model, and obtains feature information corresponding to the first voice information through the first model; the first model exists at the first terminal;

the first terminal sends the characteristic information to a second terminal so that the second terminal inputs the characteristic information into a second model, subtitle information corresponding to the first voice information is determined through the second model, and the second model exists in the second terminal.

In one possible design, the caption information is caption information in a second language.

In one possible design, the first language is different from the second language.

The method can be applied to a scene of voice subtitle conversion, such as a teleconference, the second terminal may need to generate the subtitle by using the voice information of the speaker of the first terminal and display the subtitle on a screen so as to know and learn the speaking content of the speaker of the first terminal more clearly. Further, under the condition that the second terminal starts the voice translation function, the second terminal can translate the first voice information (such as English voice information) of the speaker using the first terminal into subtitles of a corresponding language (such as Chinese) according to the characteristic information of the first voice information, so that a user using the second terminal can know the speaking meaning of the speaker at the opposite end more.

In addition, because the operation of converting the voice into the subtitle is jointly realized by the first terminal and the second terminal, the first terminal does not need to be responsible for converting the voice information into corresponding operation information, so that the calculated amount of the first terminal is reduced, the running speed of the first terminal can be improved, and the efficiency of converting the voice into the subtitle is further improved.

In one possible design, the second terminal includes a terminal that turns on a speech translation function.

In one possible design, the first terminal sends the feature information to a second terminal, including: the first terminal broadcasts the characteristic information.

A fourth aspect provides a speech recognition method that may be applied to a second terminal or a component (such as a system-on-a-chip) implementing a function of the second terminal. Taking the second terminal as an example, the method comprises the following steps:

the second terminal receives characteristic information corresponding to the first voice information; the first voice information is voice information of a first language;

the second terminal inputs the characteristic information into a second model, and determines subtitle information corresponding to the first voice information through the second model; the second model exists at the second terminal.

In one possible design, the method further comprises: and determining second voice information of a second language corresponding to the first voice information, and playing the second voice information of the second language. Wherein the first language is different from the second language. Similar to simultaneous interpretation, in this scheme, the second terminal may translate the first voice information (english voice information) of the speaker using the first terminal into the second voice information (chinese voice information), and play the second voice information, while also displaying subtitles (e.g., chinese subtitles) in the corresponding language. Or the second terminal can play the bilingual voice information and simultaneously display the bilingual caption. Or the second terminal plays the single-language voice information and displays the bilingual subtitle information, or the second terminal plays the bilingual voice information and displays the single-language subtitle information. The technical solution of the present application is not limited thereto.

In one possible design, the first terminal sends the feature information to a second terminal, including: the second terminal broadcasts the characteristic information.

A fifth aspect provides a first terminal, comprising:

the processing module is used for responding to the voice information input by the user, inputting the voice information into the first model and obtaining the characteristic information corresponding to the voice information through the first model; the first model exists at the first terminal;

and the communication module is used for sending the characteristic information to the second terminal so that the second terminal inputs the characteristic information into a second model, determining operation information corresponding to the voice information through the second model, and executing corresponding operation according to the operation information, wherein the second model exists in the second terminal.

In one possible design, the communication module is configured to send the feature information to the second terminal, and includes: the first terminal broadcasts the characteristic information.

A sixth aspect provides a second terminal, comprising:

the communication module is used for receiving characteristic information corresponding to the voice information from the first terminal; the characteristic information is obtained by inputting voice information into a first model by a first terminal and through the first model; the first model exists at the first terminal;

the processing module is used for inputting the characteristic information into the second model and determining operation information corresponding to the voice information through the second model; the second model exists at the second terminal;

and the processing module is used for executing corresponding operation according to the operation information.

if the operation information corresponding to the voice information is the operation information matched with the second terminal, the second terminal executes the target operation according to the operation information corresponding to the voice information; and/or if the operation information corresponding to the voice information is not the operation information matched with the second terminal, discarding the operation information by the second terminal.

A seventh aspect provides a first terminal, comprising:

the input module is used for receiving first voice information of a first language input by a user;

the processing module is used for responding to the first voice information, inputting the first voice information into a first model and obtaining characteristic information corresponding to the first voice information through the first model; the first model exists at the first terminal;

the communication module is used for sending the characteristic information to a second terminal so that the characteristic information is input into a second model by the second terminal, and subtitle information corresponding to the first voice information is determined through the second model; the second model exists at the second terminal.

In one possible design, the communication module is configured to send the feature information to the second terminal, and includes: broadcasting the characteristic information.

An eighth aspect provides a second terminal, comprising:

the input module is used for receiving characteristic information corresponding to the first voice information; the first voice information is voice information of a first language; the characteristic information is obtained by the first terminal inputting first voice information into a first model; the first model exists at the first terminal;

And the processing module is used for inputting the characteristic information into a second model, determining subtitle information corresponding to the first voice information through the second model, and enabling the second model to exist in the second terminal.

In one possible design, the processing module is further configured to determine second speech information in a second language corresponding to the first speech information;

and the output module is used for playing the second voice information of the second language. Wherein the first language is different from the second language.

A ninth aspect provides an electronic device having functionality to implement the distributed speech control method as in any of the above aspects and any of the possible implementations thereof. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

A tenth aspect provides a computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the distributed speech control method of any of the aspects and any of the possible implementations thereof.

An eleventh aspect provides a computer program product for, when run on an electronic device, causing the electronic device to perform the distributed speech control method as in any of the aspects and any of the possible implementations thereof.

A twelfth aspect provides circuitry comprising processing circuitry configured to perform the distributed speech control method of any of the aspects and any of the possible implementations described above.

A thirteenth aspect provides a first terminal comprising: a display screen; one or more processors; one or more memories; the memory stores one or more programs that, when executed by the processor, cause the first terminal to perform the method of any of the above aspects.

A fourteenth aspect provides a second terminal, comprising: a display screen; one or more processors; one or more memories; the memory stores one or more programs that, when executed by the processor, cause the second terminal to perform the method designed according to any one of the above aspects.

A fifteenth aspect provides a system on a chip comprising at least one processor and at least one interface circuit for performing transceiving functions and for sending instructions to the at least one processor, the at least one processor performing the distributed speech control method as in any of the above aspects and any of the possible implementations thereof when the at least one processor executes the instructions.

Drawings

Fig. 1 is a schematic flow chart of a voice control method according to an embodiment of the present application;

fig. 2A and fig. 2B are schematic flow diagrams of a voice control method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a system architecture according to an embodiment of the present disclosure;

fig. 4 and fig. 5 are schematic structural diagrams of an electronic device according to an embodiment of the present application;

fig. 6 to fig. 8 are schematic flow diagrams of a voice control method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a training method of a first model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a training method of a second model according to an embodiment of the present disclosure;

fig. 11 and fig. 12 are schematic flow diagrams of a face recognition method according to an embodiment of the present application;

fig. 13 is a flow chart of a speech information translation method according to an embodiment of the present application;

fig. 14 is a flow chart of a voice control method according to an embodiment of the present application;

FIG. 15 is a schematic view of an apparatus provided in an embodiment of the present application;

fig. 16 is a schematic diagram of a chip system according to an embodiment of the present application.

Detailed Description

Fig. 2A shows a conventional speech recognition procedure, taking an example that a user controls a television to turn up volume through a mobile phone speech, the mobile phone inputs speech information "turn up volume of the television" into a speech boundary detection (voice activity detection, VAD) model, and the VAD model intercepts human voice (speech) in the speech, and takes the human voice in the speech as an input of an automatic speech recognition (automatic speech recognition, ASR) model. The ASR model converts an input sound signal into text and outputs the text. The characters are converted into user operation information corresponding to the characters through natural language understanding (natural language understanding, NLU) models or regular matching. And then, the mobile phone generates a control signal according to the user operation information (namely, the volume of the television is increased), and sends the control signal to the television, and the television increases the volume according to the control signal.

In the implementation corresponding to fig. 2A, if a new type of device (such as a device belonging to a different manufacturer from the mobile phone) is connected to the mobile phone, the mobile phone manufacturer may retrain the NLU model for speech recognition or update the regular matching, taking into account compatibility between the mobile phone and the new device, etc. The retrained NLU model or canonical match may be packaged in an installation package of an application for implementing voice control (e.g., an application for managing smart homes) so that a user may download a new version of the application onto a cell phone by way of an update application or the like, and further process artificial intelligence tasks (e.g., voice recognition tasks) using the relevant model through the new application. Illustratively, the handset is currently connected to a television, a speaker. The mobile phone can control the television and the sound box through the intelligent home APP. The mobile phone detects that the new type of equipment (such as an intelligent desk lamp) establishes connection with the mobile phone, and reports the information of the detected new type of equipment to the server. After the mobile phone manufacturer knows that a new type of equipment is connected with the mobile phone through the server, the NLU model is retrained. After the mobile phone manufacturer trains the model, the trained model can be packed in an installation package of the intelligent home APP, and the updated intelligent home APP is stored in the server. The user can download the intelligent house APP of updating to the cell-phone in, the cell-phone and through this APP of updating, the intelligent desk lamp of new increase in the control network, for example, the user can open, close, adjust the luminance of intelligent desk lamp through speech information control intelligent desk lamp.

From the perspective of users, in the current technical scheme, mobile phone manufacturers frequently train models, which means that users need to frequently update applications, and user experience is poor. From the perspective of the mobile phone, in the current technical scheme, the mobile phone processes the voice recognition task, and the task including the operation information recognition is generally required to be completed, so that the load of the mobile phone is higher, the processing delay is also higher, and the voice control efficiency is lower.

Fig. 2B shows another existing speech recognition scheme. In this approach, a spoken understanding (spoken language understanding, SLU) model is used instead of the ASR model and NLU model (or regular matching) described above. The SLU model may directly convert the sound signal into user operation information. According to the scheme, although the voice signal can be directly converted into the user operation information, when the connection between the new type equipment and the mobile phone is detected, the SLU model still needs to be retrained, and the later model maintenance cost is still high. Secondly, as the types and the number of devices connected with the mobile phone are increased, the SLU model needs more and more operation information to be identified, and complicated model structural support is needed, so that the operation speed of the mobile phone is slow. In addition, the SLU model requires a precise voice command as input, and is prone to misrecognition during daily boring.

In the technical solutions of fig. 2A and fig. 2B, the mobile phone end needs to complete multiple tasks including operation information identification, so that the load of the mobile phone is high, and each time a new type of equipment is detected to be connected with the mobile phone, the manufacturer of the mobile phone needs to re-develop and train a new neural network to match the new type of equipment. It can be seen that in the existing speech recognition scheme, the load of the mobile phone is higher, and the processing delay is also higher, so that the efficiency of speech control is lower.

In order to improve efficiency of voice control, the embodiment of the application provides a voice recognition method. The method is applicable to a system requiring voice control. As shown in fig. 3, an exemplary diagram of a system architecture provided in an embodiment of the present application includes one or more electronic devices, such as electronic device 100 and electronic device 200 (e.g., smart home devices 1-3).

Wherein, a connection relationship can be established between the electronic devices. Optionally, the manner in which the connection is established between the devices includes, but is not limited to, one or more of the following: communication connection is established by scanning a two-dimensional code or a bar code, connection is established by a wireless fidelity (wireless fidelity, wi-Fi) protocol, a bluetooth or other communication protocol, and connection is established by a near-field communication service (near-field service). After the communication connection is established, data and/or signaling may be transferred between the devices. The embodiments of the present application do not limit the manner in which a connection is established between electronic devices.

In some scenarios, other devices connected may be voice controlled by a device. Taking the example of controlling the intelligent home equipment through mobile phone voice, a user inputs voice information to the mobile phone 100 to 'turn up the volume of the television', the mobile phone 100 extracts characteristic information corresponding to the voice information and sends the characteristic information to the intelligent home equipment 1-3 connected with the mobile phone 100, the intelligent home equipment 1-3 processes the characteristic information to obtain operation information corresponding to the voice information, and whether response is needed or not is judged according to the operation information corresponding to the voice information. It should be understood that the operational information includes, but is not limited to, operational instructions, control instructions. Optionally, the operation information further includes classification results obtained by the smart home device according to the feature information, for example, classifying different operation instructions. The intelligent home equipment can execute corresponding operation according to the operation information. Different types of operation information (such as different control instructions) are used for controlling the smart home devices to execute different operations.

Specifically, after the smart home device 3 (television) receives the feature information, the feature information is processed, it is determined that the operation information corresponding to the voice information is "want to turn up the volume of the television", and the operation corresponding to the operation information is performed, that is, the volume is turned up. After receiving the feature information from the mobile phone 100, the smart home device 1 (desk lamp) processes the feature information to obtain operation information corresponding to the voice information, and determines not to execute the corresponding operation according to the operation information. Optionally, the desk lamp discards the operation information. Similarly, the smart home device 2 (air conditioner) does not perform the corresponding operation either. In the process, the step of identifying the operation information (operation instruction) is completed by each intelligent household device without being completed in the mobile phone, so that the calculated amount of the mobile phone can be reduced, and the efficiency of the intelligent voice control flow is improved.

Optionally, the mobile phone extracts feature information corresponding to the voice information, which can be implemented as follows: the mobile phone inputs the voice information corresponding to the voice information into the first model, and the first model outputs the characteristic information corresponding to the voice information. And the first model is used for converting the voice information into the corresponding characteristic information.

Optionally, the smart home device processes the feature information from the mobile phone to obtain operation information (such as a control instruction) corresponding to the voice information, which may be implemented as: the intelligent home equipment inputs the characteristic information from the mobile phone into the second model, and the second model outputs the operation information corresponding to the voice information. And the second model is used for converting the characteristic information into corresponding operation information. The first model, the second model, the feature information, and the like will be described in detail below.

In the embodiment of the present application, the electronic device may also be referred to as a terminal.

Optionally, the system further comprises one or more servers 300. The server may establish a connection with the electronic device. In some embodiments, the electronic devices may be connected by a server. For example, in the system shown in fig. 1, the mobile phone 100 may remotely control the smart home device through the server 300.

In some embodiments, the first model and the second model may be trained by the server 300, and after the server 300 trains the first model and the second model, the trained first model and second model may be issued to each terminal. In other embodiments, the first model, the second model may be trained by the terminal, such as by a cell phone.

Alternatively, the first model and the second model may be models obtained based on any algorithm, for example, may be models based on neural networks, and may be a combination of one or more of convolutional neural networks (Convolutional Neural Networks, CNN), cyclic neural networks (Recurrent Neural Network, RNN), deep neural networks (Deep Neural Networks, DNN), multi-Layer Perceptron (MLP), and gradient-lifting tree (Gradient Boosting Decison Tree, GBDT).

By way of example, the electronic device 100 and the electronic device 200 may be a mobile phone, a tablet computer, a personal computer (personal computer, PC), a personal digital assistant (personal digital assistant, PDA), a smart watch, a netbook, a wearable electronic device, an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, a vehicle-mounted device, a smart car, a smart sound, a robot, an earphone, a camera, or the like, which may be used for voice control or be controlled by voice, and the specific form of the electronic device 100 and the electronic device 200 is not particularly limited in this application.

The terms "first" and "second" and the like in the description and in the drawings of the present application are used for distinguishing between different objects or for distinguishing between different processes of the same object. The words "first," "second," and the like may distinguish between identical or similar items that have substantially the same function and effect. For example, the first device and the second device are merely for distinguishing between different devices, and are not limited in their order of precedence. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

Furthermore, the terms "comprising" and "having," and any variations thereof, as used in the description of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the present application and in the drawings, "english: of", corresponding to "and" corresponding to "are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

Taking the electronic device 100 as an example of a mobile phone, fig. 4 shows a schematic structural diagram of the electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces.

In some embodiments of the present application, the process of processing the voice information by the electronic device 100 to obtain the feature information, and the process of processing the feature information from the electronic device 100 by the electronic device 200 to obtain the operation information corresponding to the voice information may also be implemented in the processor 110 in the electronic device 100. The electronic device 100 is also referred to as a first terminal and the electronic device 200 is also referred to as a second terminal.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present invention is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G/6G, etc. applied on the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue.

The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc.

The SIM card interface 195 is used to connect a SIM card.

The above-mentioned electronic device 100 is merely used to illustrate the structure of the electronic device in the embodiment of the present application, but the structure and the form of the electronic device are not limited. The embodiment of the application does not limit the structure and the form of the electronic equipment. By way of example, fig. 5 illustrates another exemplary architecture of an electronic device. As shown in fig. 5, the electronic device includes: a processor 501, a memory 502, a transceiver 503. The processor 501 and the memory 502 may be implemented by a processor and a memory of the electronic device 100. A transceiver 503 for the electronic device to interact with other devices, such as the electronic device 100. The transceiver 503 may be a device based on a communication protocol such as Wi-Fi, bluetooth, or other.

Alternatively, the structure of the server may be referred to as the structure shown in fig. 5, and will not be described herein.

In other embodiments of the present application, an electronic device or server may include more or fewer components than shown, or combine certain components, or split certain components, or replace certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The technical solutions involved in the following embodiments may be implemented in devices having structures as shown in fig. 4 and 5.

For example, taking a smart home scenario as an example, as shown in fig. 6, a mobile phone includes a first model, and each smart home device includes a second model. The first model is trained and deployed on the mobile phone by a mobile phone manufacturer. The first model can be used for acquiring multi-dimensional characteristic information corresponding to the voice information. In the embodiment of the present application, the weight of the first model is generally fixed, and frequent updating of the first model is not required.

The second model is trained by the manufacturer of each smart home device. The second model can be used for converting the multidimensional characteristic information corresponding to the voice information into corresponding operation information. In this embodiment, for the second model, the manufacturer of each smart home device may update according to the actual use requirement. That is, in the case of a later-stage newly-added device, only the manufacturer of the newly-added device is generally required to retrain the second model for identifying the operation information (such as classifying the control instruction) in the newly-added device, and the mobile phone manufacturer does not need to frequently update the first model for identifying the feature information, so that the model training and maintenance cost of the manufacturer such as the mobile phone can be reduced. In addition, the second model only relates to the operation information (such as classifying the control instructions) for identifying the specific equipment, so that the model is small and convenient to train and update.

Optionally, updating the second model includes: the weights of the second model are updated.

It should be noted that, since different smart home devices may come from different device manufacturers, the algorithm used by each manufacturer to train the second model may be different, and thus, the second model on the different smart home devices may be different.

In the scenario shown in fig. 6, the user inputs voice information "turn up the volume of the television" to the mobile phone, and after the mobile phone detects the voice information input by the user, the voice information may be input to the first model, and the feature information of the voice information may be output by the first model. Alternatively, the characteristic information of the voice information may be output in a form such as, but not limited to, a characteristic matrix. After obtaining the feature information, the mobile phone may send the feature information to each smart home device (such as a desk lamp, an air conditioner, and a television shown in fig. 6) connected to the mobile phone.

After the intelligent home equipment receives the characteristic information from the mobile phone, the characteristic information can be input into the second model, and the second model outputs operation information corresponding to the voice information. As shown in fig. 6, the television can recognize that the operation information (the classification result of the control instruction) is to turn up the volume according to the second model, so that the television can perform an operation corresponding to the operation information, that is, adjust the volume, according to the recognized operation information. The desk lamp can not recognize the operation information, or the operation information output by the desk lamp through the second model is not matched with the desk lamp, or the operation information output by the desk lamp through the second model is not matched with the operation information (control instruction) executable by the desk lamp, so that the desk lamp can determine that the voice information of the user is not used for controlling the desk lamp, and the desk lamp does not need to respond to the voice information of the user according to the operation information (such as the control instruction) output by the second model. Similarly, the air conditioner does not have to respond to the user's voice information.

Compared with the prior art that the mobile phone is required to complete the process from the feature information extraction to the operation information identification, so that the calculated amount of the mobile phone is large and the voice control efficiency is low, the feature information extraction and the operation information identification process are decoupled in the voice control scene of the intelligent home equipment. The characteristic information extraction flow is executed by the mobile phone, and the operation information identification is executed by each intelligent home device. Compared with the prior art, the mobile phone does not execute operation of operation information identification, so that the calculated amount is reduced, the running speed of the mobile phone can be improved, and the efficiency of voice control is further improved.

Specific interactions between devices in the voice control process according to the embodiments of the present application are described below. As shown in fig. 7, taking an example that a user controls to increase the volume of a television through voice control of a mobile phone, the voice control method provided in the embodiment of the application includes:

s101, the mobile phone detects that a user inputs voice information.

Illustratively, the voice information input by the user is "turn up the volume of the television".

S102, the mobile phone converts the voice information into characteristic information.

Generally, the voice information belongs to an analog signal, the mobile phone needs to convert the voice information into a digital signal through a coding model, and extracts characteristic information, and then other devices can recognize operation information corresponding to the voice information according to the extracted characteristic information.

The characteristic information refers to the components with identification from the voice information, and the voice components with identification can accurately describe the distinction between a section of voice and other voices. Optionally, the recognizable component in the voice information includes, but is not limited to, a sound spectrum, and phonemes of the sound spectrum. Phonemes of the sound spectrum include, but are not limited to, formants in the sound spectrum. The feature information in the embodiment of the present application is not limited to the above-listed ones, but the embodiment of the present application is limited to the feature information, and the feature information is any information that can be identified in speech.

As one possible implementation manner, the mobile phone includes a first model, and the mobile phone can input voice information into the first model, calculate and output feature information corresponding to the voice information by the first model. The training manner of the first model can be seen in the following examples.

In this embodiment of the present application, the first model may be implemented as an encoding model (may also be referred to as an encoding module, an encoding neural network, or have other names), and the names do not constitute a limitation on the encoding model. The coding model can be regarded as a functional module on the mobile phone, and the functional module is used for converting information corresponding to voice information into characteristic information of voice.

Alternatively, other models, such as VAD models, may be integrated in the first model in addition to the coding model. Optionally, the first model may also integrate other functions, and the embodiment of the present application does not limit whether the first model integrates other functions or not and specific types of other functions.

Similarly, the second model in embodiments of the present application may be implemented as a decoding model (which may also be referred to as a decoding module, a decoding neural network, or by other names). The decoding model may also be regarded as a functional module in a device, such as a television, for converting characteristic information of speech into corresponding operational information.

Optionally, other models or modules may be integrated in the second model besides the decoding model, and the embodiment of the present application does not limit whether the second model integrates other functions or specific types of other functions.

In the embodiment of the present application, the first model may also be referred to as a first model file, or other names. The second model, which may also be referred to as a second model file, or other name. The names do not constitute a limitation on the first model, the second model.

S103, broadcasting characteristic information by the mobile phone.

Correspondingly, each device connected with the mobile phone such as the television receives the characteristic information from the mobile phone.

In the embodiment of the application, the mobile phone does not execute the operation of identifying the operation information, but the mobile phone controls each device to complete the identification of the operation information, so the mobile phone does not know which device the voice information of the user is used for controlling. And then, the mobile phone needs to broadcast the characteristic information to all connected devices, and after the other devices respectively recognize the operation information corresponding to the characteristic information, the mobile phone judges whether the voice information of the user is used for controlling the mobile phone to execute the operation, if so, the mobile phone responds to the voice information of the user to execute the operation corresponding to the voice information, and if not, the mobile phone does not respond to the voice information of the user to not execute the operation corresponding to the voice information.

S104, the television converts the characteristic information into operation information corresponding to the television.

As one possible implementation, the television includes a second model. After the television receives the characteristic information (such as the characteristic matrix of the voice) of the voice from the mobile phone, the characteristic information of the voice is input into the second model, and the operation information corresponding to the voice information is determined and output by the second model. Illustratively, in the scenario shown in fig. 6, the television inputs feature information (such as a feature matrix) of the voice into a second model, and the second model calculates and determines that the operation information corresponding to the feature information is "turn up the volume of the television".

S105, the television responds to the operation information and executes the operation corresponding to the operation information.

For example, still as shown in the scenario of fig. 6, after the television recognizes that the operation information corresponding to the voice information of the user is "turn up the volume of the television", and the operation information is the operation information matched with the television, the target operation corresponding to the operation information, that is, turn up the volume, may be performed in response to the operation information.

Next, interactions between devices in the voice control method are explained in connection with functional modules inside the devices. As shown in fig. 8, the voice control method in the embodiment of the present application includes:

s201, the mobile phone detects voice information and inputs the voice information into the VAD model.

Illustratively, the user inputs voice information "turn up the volume of the television" to the mobile phone, and after the mobile phone detects the voice information, the mobile phone inputs voice information corresponding to the voice information to the VAD model.

S202, the VAD model detects the voice information in the voice information and inputs the voice information in the voice information into the coding model of the mobile phone.

Considering that when a user inputs voice information, other sounds in the environment may be collected at the same time when the mobile phone collects the voice information, so that in order to reduce the data processing amount of the subsequent calculation process and avoid the interference of environmental noise, the mobile phone can identify the voice information and non-voice information (noise) in the collected voice information through the VAD model. The VAD model may be any type of model capable of performing the task of speech classification.

Alternatively, the collected original voice information may be segmented into a plurality of segments (frames), for example, 20ms or 25ms frames, and the voice information is input into the VAD model, and the classification result of the voice information is output by the VAD. Optionally, the VAD model outputs the classification result of each frame belonging to the voice or non-voice, and takes the voice belonging to the voice as the input of the subsequent coding model.

The training process of the VAD model according to the embodiments of the present application may refer to the prior art, and is not described herein again.

In the embodiment of the application, the VAD model may also be regarded as a functional module on the mobile phone, where the functional module has the function of identifying the voice and the non-voice.

S203, the coding model outputs characteristic information corresponding to the voice information according to the voice information in the voice information.

Optionally, the coding model divides the voice information into a plurality of frames, and for each frame obtained, the coding model extracts characteristic information of the voice information according to a certain rule, such as, but not limited to, a (mel frequency cepstrum coefficient, MFCC) rule of the human ear hearing. Alternatively, the coding model may convert the extracted feature information into feature vectors.

Illustratively, a way is given in which the coding model extracts the feature information. First, voice information is preprocessed. Pretreatment includes, but is not limited to: the speech information is divided into a plurality of sub-frames. Thereafter, for each frame, the following operations are performed:

And obtaining a frequency spectrum corresponding to the framing through fast Fourier transform (fast fourier transform, FFT), and processing the obtained frequency spectrum through a Mel filter bank to obtain a Mel frequency spectrum corresponding to the framing. In this way, the linear natural spectrum can be converted into Mel spectrum that exhibits human auditory properties. Then, carrying out cepstrum analysis on the Mel frequency spectrum corresponding to the frame to obtain corresponding MFCC, wherein the MFCC can be used as characteristic information corresponding to the voice information of the frame.

After the feature information of each frame of the voice is obtained, the feature information of each frame may be combined to obtain feature information (such as feature vector) corresponding to the voice information.

It should be noted that the method for extracting feature information by the coding model may be other, and is not limited to the above-listed method.

S204, the communication module of the mobile phone obtains the characteristic information corresponding to the voice information.

Optionally, the communication module is used for supporting the mobile phone to communicate with other electronic devices. For example, the communication module may be connected to a network via wireless communication or wired communication to communicate with other personal terminals or network servers. The wireless communication may employ at least one of cellular communication protocols such as 5G, long Term Evolution (LTE), long term evolution-advanced (LTE-a), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), universal Mobile Telecommunications System (UMTS), wireless broadband (WiBro), or global system for mobile communications (GSM). The wireless communication may include, for example, short-range communication. The short-range communication may include at least one of wireless fidelity (Wi-Fi), bluetooth, near Field Communication (NFC), magnetic Stripe Transmission (MST), or GNSS.

As a possible implementation manner, a processing module (such as a processor) in the mobile phone may obtain the output result of the encoding module, that is, obtain the feature information corresponding to the voice information, and send the feature information corresponding to the voice information to the communication module, where the step S205 is executed by the communication module of the mobile phone.

S205, the communication module of the mobile phone broadcasts the characteristic information corresponding to the voice information.

Correspondingly, the communication module of the television receives the characteristic information of the voice from the mobile phone.

S206, the decoding model of the television obtains the characteristic information corresponding to the voice information.

As one possible implementation manner, after the communication module of the television receives the feature information corresponding to the voice information from the mobile phone, the feature information is sent to the processing module of the television, and the processing module inputs the feature information into the decoding model.

S207, outputting operation information corresponding to the characteristic information according to the characteristic information corresponding to the voice information by a decoding model of the television.

Alternatively, the decoding model may be a model for performing a classification task, whose output content is operation information corresponding to the voice information.

It should be noted that, the decoding model in the embodiment of the present application is different from a decoder in a conventional ASR, where the decoder in the conventional ASR may convert feature information corresponding to speech information into characters, and then the subsequent functional module converts the characters into corresponding operation information. It can be seen that the conversion efficiency of the decoding model in the embodiment of the present application is higher.

S208, judging whether the operation information output by the decoding model is television matched operation information. If yes, the following step S209 is executed, and if no, S210 is executed.

S209, responding to the operation information, and executing the operation corresponding to the operation information.

For example, still as shown in fig. 6, after the television receives feature information (such as a feature matrix) corresponding to the voice information from the mobile phone, the television outputs operation information (such as a control instruction) to "turn up the volume of the television" through a second model (such as a decoding model), and according to the operation information, performs an operation corresponding to the operation information, that is, turns up the volume.

S210, not responding to the operation information and not executing the operation corresponding to the operation information.

For example, still as shown in the scenario of fig. 6, after the air conditioner receives the feature information corresponding to the voice information from the mobile phone, the air conditioner outputs operation information "other (other) types of operation information" corresponding to the voice information through a second model (such as a decoding model), where the operation information (such as a control instruction) indicates that the voice information of the user is not used to control the air conditioner. Then, the air conditioner does not perform a corresponding operation according to the operation information. Similarly, after receiving the characteristic information corresponding to the voice information, the desk lamp outputs corresponding operation information according to the characteristic information, and determines that corresponding operation is not required to be executed.

By including the encoding neural network at the control device (e.g., handset) side of the voice information and the decoding neural network at the controlled device (e.g., home device) side of the voice information, training of the decoding neural network can be performed by various third party vendors. Different vendors may train respective decoding neural networks. On one hand, the development cost after the new equipment is added is greatly reduced without frequently training the coding neural network, and on the other hand, the mobile phone side only executes the operation of extracting the characteristic information in the voice recognition and does not execute the process of identifying the operation information, so that the operation amount and the power consumption of the mobile phone can be reduced, the operation speed is improved, and the delay of the voice recognition process is further reduced.

The training methods of the first model and the second model are described as follows. The first model is a model trained based on at least one first sample data including first speech information, characteristic information of which is known. The second model is a model trained based on at least one second sample data, the second sample data comprising first characteristic information, and operation information corresponding to the first characteristic information is known.

Fig. 9 illustrates one training method of the first model. As shown in fig. 9 (1), first, a model for identifying operation information is trained, and N (N is a positive integer) training samples including speech information (i.e., first speech information) whose operation information is known need to be provided. The voice data can be multiple in type, so that the corpus is rich enough, and the recognition accuracy is improved. Optionally, the training samples further include a label of the voice data, which is used for characterizing the operation information corresponding to the voice information, and training the plurality of samples can obtain a model for extracting the feature information and identifying the operation information corresponding to the voice information. The model can output operation information corresponding to the voice information.

In the scenario of the training model as described in fig. 9 (1), the trained model includes 32 layers of neurons. The L1-L16 layers are used for extracting characteristic information corresponding to the voice information, and the L17-L32 layers are used for identifying operation information corresponding to the voice information. For neurons of a layer, it may be connected to one or more neurons in the next layer and output corresponding signals via the connections. The connection between some neurons in the trained model is shown in fig. 9 (1), such as the connection between the first neuron in the L1 layer and the first neuron in the L2 layer, the connection between the first neuron in the L1 layer and the second neuron in the L2 layer, the weight w11, the weight w12, and so on.

Optionally, in order to improve the recognition accuracy of the model, the model can be evaluated and tested. When the recognition rate of the model reaches a certain threshold, the model is trained. When the recognition rate of the model is low, the model can be continuously trained until the recognition accuracy of the model reaches a certain threshold.

Alternatively, the training process of the model may be on the end side (such as a terminal like a mobile phone) or the cloud side (such as a server). The training may be offline training or online training. The specific training mode of the model is not limited in the embodiment of the application.

As shown in fig. 9 (2), after a complete model for extracting feature information and identifying operation information (e.g., classifying control instructions) is trained, a model for extracting feature information is obtained by removing the L17-L32 layer corresponding portion for identifying operation information (e.g., identifying control instructions) from the complete model. As shown in fig. 9 (2), the first model for extracting feature information includes 16 layers of L1 to L16, and after inputting voice data (also referred to as voice information) to the first model, the first model may output feature vectors (also referred to as feature information) corresponding to the voice data.

In summary, taking the first model as an encoder and taking the second model as a decoder as an example, the encoder in the embodiment of the present application is an encoder part in a trained encoder-decoder model, which is equivalent to extracting an encoder part in one encoder-decoder model to form the first model.

One training and method of use of the second model is illustrated in fig. 10. As shown in fig. 10 (1), the first model is obtained before training the second model. As a possible implementation manner, if the first model for extracting the feature information corresponding to the voice information is trained by the mobile phone, the mobile phone may upload the first model to the server. Subsequently, the other device may obtain the first model from the server and train the second model according to the first model. Alternatively, the device may obtain the first model in the mobile phone through other manners, and the embodiment of the present application does not limit a specific manner in which the device obtains the first model.

As one possible implementation manner, when training the second model, the output of the first model is used as the input of the second model, so as to form a neural network for training. The input of the first model is used as the input of the whole neural network, the output of the second model is used as the output of the whole neural network, and the weight of the first model is kept unchanged in the training process. For example, as shown in fig. 10 (1), the output of the first model (i.e., the feature information corresponding to the voice information) may be used as a training sample, and the second model may be trained based on the training sample. The trained second model has the function of outputting operation information according to the input characteristic information. Taking the example that the television recognizes the operation information corresponding to the voice information through the second model, as shown in (2) of fig. 10, the television may input the feature information (such as the feature vector received from the mobile phone) corresponding to the voice information, which is unknown in the operation information, into the second model, and then output the operation information (such as turning up the volume of the television) corresponding to the voice information by the second model.

In other embodiments, the apparatus may also train the second model alone, i.e. without obtaining the first model. In this implementation, the training sample is also a feature vector (an example of the first feature information) corresponding to the voice information, and the training sample is trained to obtain the second model.

The method is not limited to the voice control scene, and other distributed task processing scenes can be applicable to the method. Exemplary, distributed task processing scenarios include, but are not limited to: teleconferencing scenarios (including but not limited to real-time translation scenarios), face recognition verification scenarios.

In the face recognition verification scenario, taking the face recognition by a mobile phone as an example, as shown in fig. 11, the face recognition model may be split into at least a first model and a second model. The camera module (such as a camera) of the mobile phone comprises a first model. The first model is used for extracting feature information of a face in the face image. The processing module of the mobile phone comprises a second model. The second model is used for outputting the recognition result of the face according to the characteristic information of the face.

Fig. 12 shows an exemplary flow of the method of the embodiment of the present application in a face recognition scenario, the flow comprising the steps of:

S301, the camera module collects face images input by a user.

S302, the camera module inputs the face image into a first model, and the first model outputs characteristic information of the face image.

S303, the camera shooting module transmits the characteristic information of the face image to the processing module.

S304, the processing module inputs the characteristic information of the face image into a second model, and the second model outputs the recognition result of the face.

S305, the processing module judges whether the face is a legal face or not according to the face recognition result. If yes, S306 is executed, and if no, S307 is executed.

S306, executing the operation corresponding to the face information.

In an exemplary payment scenario, when a user inputs a face image and the mobile phone judges that the face is a legal face through a first model in the camera module and a second model in the processing module, a payment operation is executed. In the screen unlocking scene, a user inputs a face image, and when the mobile phone judges that the face is a legal face through a first model in the camera module and a second model in the processing module, the screen is unlocked.

S307, the corresponding operation of the face information is not executed.

In the above description, the camera is taken as a module on the mobile phone as an example, in other scenarios, the camera where the first model is located may also be in a module independent of the mobile phone, and the mobile phone includes the second model. Therefore, the external camera of the mobile phone can finish the face recognition process together with the mobile phone, and the model is split into the first model and the second model, so that the face recognition efficiency can be improved.

Similarly, in other distributed intelligent scenarios, a device may split one or more models for performing one or more tasks into multiple sub-models, and deploy the multiple sub-models in multiple modules of the device, with the multiple modules sharing the model running load of a single module. The embodiment of the application does not limit the specific splitting mode of the model, and does not limit the specific distributed deployment in which modules after the model is split into a plurality of sub-models.

In the teleconference real-time translation scene, the existing model can be split into at least a first model and a second model, wherein the first model is operated at the speaker device, and the second model is operated at the receiver device. Fig. 13 shows an exemplary flow of the method of the embodiments of the present application in a teleconferencing translation scenario, the flow comprising the steps of:

s401, an audio acquisition module of the mobile phone A acquires first language information of a source language (namely a first language) and inputs the first voice information into a first model of the mobile phone A.

Optionally, the audio acquisition module includes, but is not limited to, a microphone. Taking english translation as an example, the first voice information of the source language is english "this speech", and the audio acquisition module of the mobile phone a acquires the english voice information of the speaker and inputs the english voice information into the first model.

S402, extracting characteristic information of the first voice information by the first model.

For example, feature information of the english voice information, that is, feature information corresponding to the english voice "this meeting is" is extracted.

S403, the communication module of the mobile phone A obtains the characteristic information of the first voice information.

As one possible implementation manner, the communication module of the mobile phone a obtains the feature information of the first voice information from the first model, or the communication module of the mobile phone a obtains the feature information of the first voice information from the processing module.

S404, the communication module of the mobile phone A sends the characteristic information of the first voice information to the communication module of the mobile phone B.

S405, the second model of the mobile phone B obtains the characteristic information of the first voice information.

As a possible implementation, the second model obtains the characteristic information of the first voice information from the communication module. Alternatively, the processing module inputs the feature information into the second model, i.e., the second model obtains the feature information of the first speech information from the processing module.

S406, the second model determines caption information and/or second voice information of a target language (second language) corresponding to the first voice information according to the characteristic information of the first voice information.

Optionally, the first language is different from or the same as the second language.

In an exemplary scenario, when the mobile phone B starts the voice translation (e.g., in-translation) function, after receiving the feature information of the first voice information, the mobile phone B may automatically input the feature information into the second model, and translate the feature information corresponding to the english voice information into the chinese subtitle through the second model. For example, the corresponding chinese operation information (such as a control instruction) is output according to the english "this meeting is" operation information of "this meeting is".

For another example, in a scenario that the mobile phone B does not start the cross-language translation function, the second model outputs an identifying result of the english operation information according to the feature information corresponding to the english voice information. For example, the corresponding english operation information "this meeting is" is output.

S407, the processing module of the mobile phone B controls the display of the subtitle information in the second language and/or plays the second voice information in the second language.

Illustratively, the processing module controls the display module to display the translated Chinese caption "this meeting is", and the display module may also display the English caption "this meeting is". Still further exemplary, the processing module controls the audio output module (e.g., speaker) to play the translated chinese speech "this meeting is" and the speaker may also play the english speech "this meeting is".

In the teleconference translation scene, the mobile phone B only needs to run the process of the characteristic information of the source language-the translation result, and does not need to run the process of the stage of the voice information of the source language-the characteristic information of the source language (namely, extracting the characteristic information), so that the operation amount of the mobile phone B is reduced, and the translation efficiency can be improved.

The training methods of the first model and the second model in the teleconference scenario and the face recognition scenario can be referred to the model training methods of fig. 9 and fig. 10, and will not be described herein. In one possible design, the first model is a model trained based on at least one first sample data, the first sample data including first speech information, characteristic information of the first speech information being known, and/or the second model is a model trained based on at least one second sample data, the second sample data including first characteristic information, the operation information corresponding to the first characteristic information being known.

It can be seen that, through the technical solution in the embodiment of the present application, a more complex parametric model (including but not limited to a machine learning model) and a non-parametric model may be split into multiple sub-models with smaller granularity, and the multiple sub-models are respectively run in different modules of the same device, or respectively run in different devices in the same network (such as the above-mentioned speech recognition scenario), or respectively run in multiple devices in different networks (such as the above-mentioned teleconferencing scenario). Therefore, the operation amount on a single module or equipment can be reduced, and the processing efficiency of the whole task processing flow is improved. The splitting granularity, splitting mode and modules or devices which are deployed after splitting of the sub-model are not limited, and the sub-model can be flexibly determined according to the characteristics of scenes, device types and the like.

Moreover, the foregoing only lists several possible application scenarios, and the technical solution of the embodiment of the present application may also be applied to other scenarios, which are limited in space and not exhaustive. By way of example, embodiments of the present application may be applied to bone voiceprint recognition scenarios. At present, one of the biological recognition technologies is a bone voiceprint technology, which has higher recognition rate, speed and convenience. The principle of identifying the identity of the person is as follows: and collecting voice information of the person, and verifying the legality of the identity of the person according to the voice information. Among them, since bone structures of everyone are unique, reflection echoes of sound between bones are also unique. The reflected echo of sound between bones may be referred to as bone voiceprints, which may be used to identify the identity of different users, similar to the principle that fingerprints may be used to identify different people.

In this embodiment of the present application, in a bone voiceprint recognition scenario, a model for bone voiceprint recognition may be split into two parts, where one part (a first model) is set in a bluetooth headset, for example, and the other part (a second model) is set in a mobile phone. After the earphone collects the voice information of the user (such as the user inputs an unlocking screen), the characteristic information of the voice signal (also called voice information) can be extracted through the set first model and sent to the mobile phone, the mobile phone recognizes whether the voice is of a legal user through the second model, and if so, corresponding operation (such as unlocking screen) is executed.

Fig. 14 shows a flow of a distributed voice control method provided in an embodiment of the present application. The method is applied to a first terminal and comprises the following steps:

s1401, the first terminal responds to voice information input by a user, inputs the voice information into a first model, and obtains feature information corresponding to the voice information through the first model.

Wherein the first model exists at the first terminal and the second model exists at the second terminal.

For example, taking the first terminal as a mobile phone, as shown in fig. 6, the mobile phone receives voice information input by a user, "turn up the volume of the television", and outputs feature information (i.e., feature matrix) of the voice information through the first model.

S1402, the first terminal sends the characteristic information to the second terminal, so that the second terminal inputs the characteristic information into the second model, determines operation information corresponding to the voice information through the second model, and executes corresponding operation according to the operation information.

Still taking fig. 6 as an example, the second terminal includes a desk lamp, an air conditioner, and a television connected to the mobile phone. After the mobile phone acquires the characteristic information corresponding to the voice information, the characteristic information is broadcasted to a desk lamp, an air conditioner and a television. The desk lamp, the air conditioner and the television recognize operation information (such as control instructions) through the second model. If the operation information identified by the television is matched with the television, the television executes a target operation corresponding to the operation information of turning up the television volume, namely turning up the playing volume of the television.

It should be noted that some operations in the flow of the above-described method embodiments are optionally combined, and/or the order of some operations is optionally changed. The order of execution of the steps in each flow is merely exemplary, and is not limited to the order of execution of the steps, and other orders of execution may be used between the steps. And is not intended to suggest that the order of execution is the only order in which the operations may be performed. Those of ordinary skill in the art will recognize a variety of ways to reorder the operations described herein. In addition, it should be noted that the details of other processes described herein in connection with other methods described herein (e.g., the method corresponding to fig. 7, the method corresponding to fig. 8) are likewise applicable in a similar manner to the method described above in connection with fig. 12.

Alternatively, some steps in method embodiments may be equivalently replaced with other possible steps. Alternatively, some steps in method embodiments may be optional and may be deleted in some usage scenarios. Alternatively, other possible steps may be added to the method embodiments.

Further embodiments of the present application provide an apparatus that may be an electronic device as described above (e.g., a folding screen phone). The apparatus may include: a display screen, a memory, and one or more processors. The display, memory, and processor are coupled. The memory is for storing computer program code, the computer program code comprising computer instructions. When the processor executes the computer instructions, the electronic device may perform the functions or steps performed by the mobile phone in the above-described method embodiments. The structure of the electronic device may refer to the electronic device shown in fig. 4 or fig. 5.

The core structure of the electronic device may be represented as the structure shown in fig. 15, and the core structure may include: processing module 1301, input module 1302, storage module 1303, display module 1304. The assembly of fig. 15 is merely exemplary, and an electronic device may include more or fewer components than shown, or may combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processing module 1301 may include at least one of a Central Processing Unit (CPU), an application processor (Application Processor, AP), or a communication processor (Communication Processor, CP). Processing module 1301 may perform operations or data processing related to control and/or communication of at least one of the other elements of the consumer electronic device. Specifically, the processing module 1301 may be configured to control the content displayed on the home screen according to a certain trigger condition. Or determining the content displayed on the screen according to preset rules. The processing module 1301 is further configured to process the input instruction or data, and determine a display style according to the processed data.

In this embodiment of the present application, if the structure shown in fig. 15 is a first electronic device (first terminal) or a chip system, the processing module 1301 is configured to respond to voice information input by a user, input the voice information into a first model, and obtain feature information corresponding to the voice information through the first model.

In this embodiment of the present application, if the structure shown in fig. 15 is a second electronic device (second terminal) or a chip system, the processing module 1301 is configured to input the feature information into a second model, and determine, according to the second model, operation information corresponding to the voice information;

and if the operation information corresponding to the voice information is determined to be the operation information matched with the second terminal, the second terminal executes target operation according to the operation information corresponding to the voice information, and/or if the operation information corresponding to the voice information is determined not to be the operation information matched with the second terminal, the second terminal discards the operation information.

The input module 1302 is configured to obtain an instruction or data input by a user, and transmit the obtained instruction or data to other modules of the electronic device. Specifically, the input mode of the input module 1302 may include touch, gesture, proximity screen, or voice input. For example, the input module may be a screen of an electronic device, acquire an input operation of a user, generate an input signal according to the acquired input operation, and transmit the input signal to the processing module 1301.

The storage module 1303 may include volatile memory and/or nonvolatile memory. The storage module is used for storing at least one relevant instruction or data in other modules of the user terminal equipment, and in particular, the storage module can store the first model and the second model.

Display module 1304, which may include, for example, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, an Organic Light Emitting Diode (OLED) display, a microelectromechanical system (MEMS) display, or an electronic paper display. For displaying user viewable content (e.g., text, images, video, icons, symbols, etc.).

Optionally, the structure shown in fig. 15 may further include an output module (not shown in fig. 15). The output module may be used to output information. Illustratively, voice information is played and output. Output modules include, but are not limited to, speakers and the like.

Optionally, the structure shown in fig. 15 may further include a communication module 1305 for supporting the electronic device to communicate with other electronic devices. For example, the communication module may be connected to a network via wireless communication or wired communication to communicate with other personal terminals or network servers. The wireless communication may employ at least one of cellular communication protocols such as 5G, long Term Evolution (LTE), long term evolution-advanced (LTE-a), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), universal Mobile Telecommunications System (UMTS), wireless broadband (WiBro), or global system for mobile communications (GSM). The wireless communication may include, for example, short-range communication. The short-range communication may include at least one of wireless fidelity (Wi-Fi), bluetooth, near Field Communication (NFC), magnetic Stripe Transmission (MST), or GNSS.

In this embodiment of the present application, if the structure shown in fig. 15 is a first electronic device or a chip system, the communication module 1305 is configured to send the feature information to the second terminal.

Optionally, sending the feature information to the second terminal includes: broadcasting the characteristic information.

In this embodiment of the present application, if the structure shown in fig. 15 is a second electronic device or a chip system, the communication module 1305 is configured to receive feature information corresponding to voice information from the first terminal.

It should be noted that, descriptions of steps in the method embodiment of the present application may be referred to modules corresponding to the apparatus, and are not described herein again.

Embodiments of the present application also provide a chip system, as shown in fig. 16, comprising at least one processor 1401 and at least one interface circuit 1402. The processor 1401 and the interface circuit 1402 may be interconnected by wires. For example, interface circuit 1402 may be used to receive signals from other devices (e.g., a memory of an electronic apparatus). For another example, interface circuit 1402 may be used to send signals to other devices (e.g., processor 1401). Illustratively, the interface circuit 1402 may read instructions stored in the memory and send the instructions to the processor 1401. The instructions, when executed by the processor 1401, may cause the electronic device to perform the various steps in the embodiments described above. Of course, the chip system may also include other discrete devices, which are not specifically limited in this embodiment of the present application.

The embodiment of the application also provides a computer storage medium, which comprises computer instructions, when the computer instructions run on the electronic device, the electronic device is caused to execute the functions or steps executed by the mobile phone in the embodiment of the method.

The embodiment of the application also provides a computer program product, which when run on a computer, causes the computer to execute the functions or steps executed by the mobile phone in the embodiment of the method.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of distributed speech control, the method comprising:

the method comprises the steps that a first terminal responds to voice information input by a user, the voice information is input into a first model, characteristic information corresponding to the voice information is obtained through the first model, and the first model exists in the first terminal;

the first terminal sends the characteristic information to a second terminal so that the second terminal inputs the characteristic information into a second model, operation information corresponding to the voice information is determined through the second model, corresponding operation is executed according to the operation information, and the second model exists in the second terminal.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the first model is a model obtained by training based on at least one first sample data, wherein the first sample data comprises first voice information, and characteristic information of the first voice information is known; and/or the number of the groups of groups,

The second model is a model trained based on at least one second sample data, the second sample data comprising first characteristic information, and operation information corresponding to the first characteristic information is known.

3. The method according to claim 1 or 2, wherein the first terminal sending the feature information to a second terminal comprises: the first terminal broadcasts the characteristic information.

4. A method of distributed speech control, the method comprising:

the second terminal receives characteristic information corresponding to the voice information from the first terminal; the characteristic information is obtained by the first terminal inputting the voice information into a first model and through the first model, wherein the first model exists in the first terminal;

5. The method of claim 4, wherein the second terminal performs a corresponding operation according to the operation information, comprising:

If the operation information corresponding to the voice information is determined to be the operation information matched with the second terminal, the second terminal executes target operation according to the operation information corresponding to the voice information; and/or the number of the groups of groups,

6. The method according to claim 4 or 5, wherein,

7. A first terminal, comprising:

a display screen;

one or more processors;

one or more memories;

the memory stores one or more programs that, when executed by the processor, cause the first terminal to perform the method of any of claims 1-3.

8. A second terminal, comprising:

a display screen;

one or more processors;

one or more memories;

the memory stores one or more programs that, when executed by the processor, cause the second terminal to perform the method of any of claims 4-6.

9. A computer readable storage medium storing computer instructions which, when run on a terminal, cause the terminal to perform the method of any one of claims 1 to 3 or to perform the method of any one of claims 4 to 6.

10. A computer program product, characterized in that the computer program product, when run on a terminal, causes the terminal to perform the method of any of claims 1 to 3 or to perform the method of any of claims 4 to 6.