WO2023065854A1 - 分布式语音控制方法及电子设备 - Google Patents

分布式语音控制方法及电子设备 Download PDF

Info

Publication number
WO2023065854A1
WO2023065854A1 PCT/CN2022/116804 CN2022116804W WO2023065854A1 WO 2023065854 A1 WO2023065854 A1 WO 2023065854A1 CN 2022116804 W CN2022116804 W CN 2022116804W WO 2023065854 A1 WO2023065854 A1 WO 2023065854A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
information
terminal
voice
voice information
Prior art date
Application number
PCT/CN2022/116804
Other languages
English (en)
French (fr)
Inventor
孟亚洲
兰国兴
白立勋
俞清华
石巍巍
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023065854A1 publication Critical patent/WO2023065854A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Definitions

  • the present application relates to the technical field of terminals, in particular to a distributed voice control method and electronic equipment.
  • the smart scene includes a voice control scene.
  • an electronic device can be used to perform voice control on other devices under the distributed voice control.
  • the user inputs voice information "turn on the TV" to the mobile phone, and the mobile phone analyzes the operation information represented by the voice information (that is, the user wants to turn on the TV), generates a control signal, and sends the control signal to TV to control the TV to turn on.
  • the phone can parse the user's voice information with the help of machine learning models.
  • mobile phone manufacturers since different devices may come from different manufacturers, mobile phone manufacturers usually need to retrain the machine learning model when a new type or device from a new manufacturer establishes a wireless connection with the mobile phone, so that the model can correctly interpret the information used to control the mobile phone.
  • Voice information for devices of new types or manufacturers It can be seen that in the existing technology, frequent retraining of the model will lead to a large amount of development by mobile phone manufacturers, and mobile phone manufacturers need to continuously retrain and maintain the entire model in the later stage.
  • the model running in the mobile phone is complex and the load is heavy, resulting in high processing delay and low voice control efficiency.
  • the present application provides a distributed voice control method and electronic equipment, which can improve the efficiency of voice control.
  • the first aspect provides a distributed voice control method, which can be applied to the first terminal or a component (such as a chip system) capable of realizing the function of the first terminal.
  • the first terminal responds to the voice information input by the user and inputs the voice information to the second terminal.
  • a model and obtain the feature information corresponding to the voice information through the first model, the first model exists in the first terminal; the first terminal sends the feature information to the second terminal, so that the feature information is input into the second terminal
  • a second model determining the operation information corresponding to the voice information through the second model, and performing corresponding operations according to the operation information, the second model exists in the second terminal.
  • the first terminal (such as a mobile phone) needs to complete the process from voice feature extraction to operation information recognition, resulting in a large amount of calculation for the first terminal and low efficiency of voice control, the technical solution of the present application,
  • the complete model for voice control can be split into at least a first model and a second model.
  • the first model exists in the first terminal, and the first terminal may extract feature information corresponding to the voice information through the first model.
  • the second model exists in the second terminal, and the second terminal can recognize operation information through the second model (such as smart home devices controlled by a mobile phone). Since the first terminal no longer performs all the steps in the voice control, such as the operation of identifying operation information, the amount of calculation is reduced, and the operating speed of the first terminal can be improved, thereby improving the efficiency of voice control.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information, the feature information of the first voice information is known, and /or
  • the second model is a model trained based on at least one second sample data, the second sample data includes: first feature information, and the operation information corresponding to the first feature information is known.
  • the first terminal and at least one second terminal are in the same local area network
  • the first terminal and at least one second terminal are in different local area networks.
  • the first terminal sending the characteristic information to the second terminal includes: the first terminal broadcasting the characteristic information to the second terminal.
  • the feature information corresponding to the voice information includes a sound spectrum corresponding to the voice information and phonemes of the sound spectrum.
  • the second aspect provides a distributed voice control method, the method comprising:
  • the second terminal receives feature information corresponding to the voice information from the first terminal; the feature information is obtained by the first terminal inputting the voice information into a first model and using the first model, and the first model exists in the first terminal;
  • the second terminal inputs the characteristic information into the second model, and determines the operation information corresponding to the voice information through the second model, and the second model exists in the second terminal;
  • the second terminal performs corresponding operations according to the operation information.
  • the second terminal performs corresponding operations according to the operation information, including:
  • the second terminal If it is determined that the operation information corresponding to the voice information is the operation information matched by the second terminal, the second terminal performs the target operation according to the operation information corresponding to the voice information; and/or,
  • the second terminal discards the operation information.
  • the first model is a model trained based on at least one first sample data, and the first sample data includes: first voice information, the feature information of the first voice information is known; and /or, the second model is a model trained based on at least one second sample data, the second sample data includes: first feature information, and the operation information corresponding to the first feature information is known.
  • the first terminal and the second terminal are in the same local area network, or the first terminal and the second terminal are in different local area networks.
  • the feature information corresponding to the voice information includes a sound spectrum corresponding to the voice information and phonemes of the sound spectrum.
  • the third aspect provides a voice recognition method, which can be applied to a first terminal or a component (such as a chip system) that realizes a function of the first terminal.
  • a voice recognition method which can be applied to a first terminal or a component (such as a chip system) that realizes a function of the first terminal.
  • the method includes:
  • the first terminal receives the first voice information in the first language input by the user;
  • the first terminal inputs the first voice information into a first model in response to the first voice information, and obtains feature information corresponding to the first voice information through the first model; the first model present at said first terminal;
  • the first terminal sends the feature information to the second terminal, so that the second terminal inputs the feature information into a second model, and determines subtitle information corresponding to the first voice information through the second model , the second model exists in the second terminal.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information
  • the feature information of the first voice information is already and/or
  • the second model is a model trained based on at least one second sample data
  • the second sample data includes: first feature information
  • the operation information corresponding to the first feature information is known .
  • the subtitle information is subtitle information in a second language.
  • the first language is different from the second language.
  • the second terminal may need to generate subtitles from the voice information of the speaker using the first terminal, and display them on the screen, so as to understand and learn more clearly The speech content of the speaker using the first terminal.
  • the voice translation function is enabled on the second terminal, the second terminal can translate the first voice information (such as English voice information) of the speaker using the first terminal according to the characteristic information of the first voice information into
  • the subtitles in the corresponding language (such as Chinese) can further enable the user using the second terminal to better understand the meaning of the speech of the speaker at the opposite end.
  • the voice-to-subtitle operation is implemented jointly by the first terminal and the second terminal, the first terminal does not need to be responsible for converting voice information into corresponding operation information. Therefore, the calculation amount of the first terminal is reduced, which can improve the performance of the first terminal. The running speed can improve the efficiency of voice-to-subtitle conversion.
  • the second terminal includes a terminal with a speech translation function enabled.
  • the first terminal sending the characteristic information to the second terminal includes: the first terminal broadcasting the characteristic information.
  • the fourth aspect provides a speech recognition method, which can be applied to a second terminal or a component (such as a chip system) that implements a function of the second terminal.
  • a speech recognition method which can be applied to a second terminal or a component (such as a chip system) that implements a function of the second terminal.
  • the method includes:
  • the second terminal receives feature information corresponding to the first voice information;
  • the first voice information is voice information in a first language;
  • the second terminal inputs the feature information into a second model, and determines subtitle information corresponding to the first voice information through the second model; the second model exists in the second terminal.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information
  • the feature information of the first voice information is already and/or
  • the second model is a model trained based on at least one second sample data
  • the second sample data includes: first feature information
  • the operation information corresponding to the first feature information is known .
  • the subtitle information is subtitle information in a second language.
  • the first language is different from the second language.
  • the method further includes: determining second voice information in a second language corresponding to the first voice information, and playing the second voice information in the second language.
  • the first language is different from the second language.
  • the second terminal can translate the first voice information (English voice information) of the speaker using the first terminal into the second voice information (Chinese voice information), and play the second voice information , and can also display subtitles in the corresponding language (such as Chinese subtitles).
  • the second terminal may also play bilingual voice information and simultaneously display bilingual subtitles.
  • the second terminal plays monolingual voice information and displays bilingual subtitle information, or the second terminal plays bilingual voice information and displays monolingual subtitle information.
  • the technical solution of the present application does not limit this.
  • the second terminal includes a terminal with a speech translation function enabled.
  • the first terminal sending the characteristic information to the second terminal includes: the second terminal broadcasting the characteristic information.
  • the fifth aspect provides a first terminal, including:
  • a processing module configured to input the voice information into the first model in response to the voice information input by the user, and obtain feature information corresponding to the voice information through the first model; the first model exists in the first terminal;
  • the communication module is configured to send feature information to the second terminal, so that the second terminal inputs the feature information into the second model, determines the operation information corresponding to the voice information through the second model, and performs corresponding operations according to the operation information,
  • the second model exists in the second terminal.
  • the first model is a model trained based on at least one first sample data, and the first sample data includes: first voice information, the feature information of the first voice information is known; and /or, the second model is a model trained based on at least one second sample data, the second sample data includes: first feature information, and the operation information corresponding to the first feature information is known.
  • the first terminal and at least one second terminal are in the same local area network
  • the first terminal and at least one second terminal are in different local area networks.
  • the communication module configured to send the feature information to the second terminal, includes: the first terminal broadcasts the feature information.
  • the feature information corresponding to the voice information includes a sound spectrum corresponding to the voice information and phonemes of the sound spectrum.
  • the sixth aspect provides a second terminal, including:
  • the communication module is configured to receive feature information corresponding to the voice information from the first terminal; the feature information is obtained by the first terminal inputting the voice information into the first model and using the first model; the first model exists in the first terminal;
  • a processing module configured to input feature information into a second model, and determine operation information corresponding to voice information through the second model; the second model exists in the second terminal;
  • the processing module is configured to perform corresponding operations according to the operation information.
  • the second terminal performs corresponding operations according to the operation information, including:
  • the second terminal If it is determined that the operation information corresponding to the voice information is the operation information matched by the second terminal, the second terminal performs the target operation according to the operation information corresponding to the voice information; and/or, if it is determined that the operation information corresponding to the voice information is not matched by the second terminal operation information, the second terminal discards the operation information.
  • the first model is a model trained based on at least one first sample data, and the first sample data includes: first voice information, the feature information of the first voice information is known; and /or, the second model is a model trained based on at least one second sample data, the second sample data includes: first feature information, and the operation information corresponding to the first feature information is known.
  • the first terminal and the second terminal are in the same local area network, or the first terminal and the second terminal are in different local area networks.
  • the feature information corresponding to the voice information includes a sound spectrum corresponding to the voice information and phonemes of the sound spectrum.
  • the seventh aspect provides a first terminal, including:
  • An input module configured to receive first voice information in a first language input by a user
  • a processing module configured to respond to the first voice information, input the first voice information into a first model, and obtain feature information corresponding to the first voice information through the first model; the first model present at said first terminal;
  • a communication module configured to send the feature information to a second terminal, so that the second terminal inputs the feature information into a second model, and determines the subtitle corresponding to the first voice information through the second model Information; the second model exists in the second terminal.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information
  • the feature information of the first voice information is already and/or
  • the second model is a model trained based on at least one second sample data
  • the second sample data includes: first feature information
  • the operation information corresponding to the first feature information is known .
  • the subtitle information is subtitle information in a second language.
  • the first language is different from the second language.
  • the second terminal includes a terminal with a speech translation function enabled.
  • the communication module configured to send the feature information to the second terminal includes: broadcasting the feature information.
  • the eighth aspect provides a second terminal, including:
  • the input module is configured to receive feature information corresponding to the first voice information; the first voice information is voice information in a first language; the feature information is that the first terminal inputs the first voice information into the first model, and passes the second voice information obtained by a model; the first model exists in the first terminal;
  • a processing module configured to input the characteristic information into a second model, and determine subtitle information corresponding to the first voice information through the second model, the second model exists in the second terminal.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information
  • the feature information of the first voice information is already and/or
  • the second model is a model trained based on at least one second sample data
  • the second sample data includes: first feature information
  • the operation information corresponding to the first feature information is known .
  • the subtitle information is subtitle information in a second language.
  • the first language is different from the second language.
  • the processing module is further configured to determine second voice information in a second language corresponding to the first voice information
  • the output module is used to play the second voice information in the second language.
  • the first language is different from the second language.
  • the second terminal includes a terminal with a speech translation function enabled.
  • the communication module configured to send the feature information to the second terminal includes: broadcasting the feature information.
  • a ninth aspect provides an electronic device, and the electronic device has a function of implementing the distributed voice control method in any of the foregoing aspects and any possible implementation manners.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the tenth aspect provides a computer-readable storage medium, including computer instructions.
  • the computer instructions When the computer instructions are run on the electronic device, the electronic device executes the distributed Voice control method.
  • An eleventh aspect provides a computer program product.
  • the computer program product is run on an electronic device, the electronic device is made to execute the distributed voice control method according to any aspect and any possible implementation thereof.
  • a twelfth aspect provides a circuit system, and the circuit system includes a processing circuit, and the processing circuit is configured to execute the distributed voice control method in any of the foregoing aspects and any possible implementation manners thereof.
  • a thirteenth aspect provides a first terminal, including: a display screen; one or more processors; one or more memories; the memory stores one or more programs, and when the one or more programs are executed by the processor, The first terminal is made to execute any method in any aspect above.
  • a fourteenth aspect provides a second terminal, including: a display screen; one or more processors; one or more memories; the memory stores one or more programs, and when the one or more programs are executed by the processor, The second terminal is made to execute any method designed in any aspect above.
  • a fifteenth aspect provides a system on a chip, including at least one processor and at least one interface circuit, at least one interface circuit is used to perform transceiver functions, and send instructions to at least one processor, when at least one processor executes instructions, At least one processor executes the distributed voice control method in any of the foregoing aspects and in any possible implementation manner.
  • FIG. 1 is a schematic flowchart of a voice control method provided in an embodiment of the present application
  • FIGS. 2A and 2B are schematic flowcharts of the voice control method provided by the embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a system provided by an embodiment of the present application.
  • Fig. 4 and Fig. 5 are schematic structural diagrams of the electronic equipment provided by the embodiment of the present application.
  • Fig. 9 is a schematic diagram of the training method of the first model provided by the embodiment of the present application.
  • Fig. 10 is a schematic diagram of the training method of the second model provided by the embodiment of the present application.
  • FIG 11 and Figure 12 are schematic flowcharts of the face recognition method provided by the embodiment of the present application.
  • FIG. 13 is a schematic flow chart of a voice information translation method provided in an embodiment of the present application.
  • FIG. 14 is a schematic flowchart of a voice control method provided in an embodiment of the present application.
  • Figure 15 is a schematic diagram of the device provided by the embodiment of the present application.
  • FIG. 16 is a schematic diagram of a chip system provided by an embodiment of the present application.
  • Figure 2A shows an existing voice recognition process.
  • the mobile phone inputs the voice information "increase the volume of the TV" into the voice activity detection (VAD) model , the human voice in the speech is intercepted by the VAD model, and the human voice in the speech is used as the input of the automatic speech recognition (ASR) model.
  • VAD voice activity detection
  • ASR automatic speech recognition
  • the ASR model converts the input sound signal into text and outputs it.
  • the text is converted into user operation information corresponding to the text after natural language understanding (NLU) model or regular matching.
  • NLU natural language understanding
  • the mobile phone generates a control signal according to the user's operation information (that is, turning up the volume of the TV), and sends the control signal to the TV, and the TV turns up the volume according to the control signal.
  • a new type of device such as a device belonging to a different manufacturer from the mobile phone
  • the mobile phone manufacturer can re- Train NLU models for speech recognition or update regular matching.
  • the retrained NLU model or regular matching can be packaged in the installation package of an application for voice control (such as an application for controlling a smart home), so that users can download a new version of the application to the Mobile phones, and then use related models to process artificial intelligence tasks (such as speech recognition tasks) through new applications.
  • the mobile phone is currently connected to a TV and a speaker.
  • the mobile phone can control the TV and speakers through the smart home APP.
  • the mobile phone detects a new type of device (such as a smart desk lamp) and establishes a connection with itself, and reports the message of the detected new type of device to the server.
  • a new type of device such as a smart desk lamp
  • the mobile phone manufacturer learns through the server that a new type of device is connected to the mobile phone, it retrains the NLU model.
  • the mobile phone manufacturer After the mobile phone manufacturer has trained the model, it can package the trained model in the installation package of the smart home APP, and store the updated smart home APP in the server.
  • the user can download the updated smart home APP to the mobile phone, and the mobile phone can control the newly added smart desk lamp in the network through the updated APP.
  • the user can control the smart desk lamp to turn on, off, and adjust the brightness of the smart desk lamp through voice information.
  • Fig. 2B shows another existing speech recognition scheme.
  • a spoken language understanding (SLU) model is used to replace the above-mentioned ASR model and NLU model (or regular matching).
  • the SLU model can directly convert sound signals into user operation information.
  • this solution can directly convert sound signals into user operation information, it still needs to retrain the SLU model when a new type of device is detected to establish a connection with the mobile phone, and the later model maintenance costs are still high.
  • the SLU model needs to identify more and more operational information, which requires complex model structure support, resulting in slow operation of mobile phones.
  • the SLU model needs to accurately input voice commands, which is prone to misrecognition during daily chatting.
  • the mobile phone needs to complete various tasks including operation information identification, which makes the load of the mobile phone high, and every time a new type of device is detected to establish a connection with the mobile phone, the mobile phone manufacturer needs to Redevelop and train new neural networks to match new types of devices. It can be seen that in the existing voice recognition schemes, the load of the mobile phone is high, and the processing delay is also high, resulting in low efficiency of voice control.
  • an embodiment of the present application provides a voice recognition method.
  • the method is applicable to systems that need voice control.
  • FIG. 3 it is an example diagram of a system architecture provided by an embodiment of the present application.
  • the system includes one or more electronic devices, such as electronic device 100 and electronic device 200 (such as smart home devices 1-3).
  • a connection relationship may be established between electronic devices.
  • the ways of establishing a connection between devices include but are not limited to one or more of the following: establishing a communication connection by scanning a two-dimensional code or a barcode, communicating through a wireless fidelity (Wi-Fi) protocol, Bluetooth, etc.
  • the protocol establishes the connection, and establishes the connection through the near-field communication service (nearby service).
  • Near-field communication service Nearby service
  • the embodiment of the present application does not limit the manner of establishing a connection between electronic devices.
  • a device can perform voice control on other connected devices.
  • voice control of smart home devices through a mobile phone as an example, the user inputs a voice message "turn up the volume of the TV" to the mobile phone 100, and the mobile phone 100 extracts the feature information corresponding to the voice message, and sends the feature information to the smart home device connected to the mobile phone 100 1-3, the smart home device 1-3 processes the feature information to obtain the operation information corresponding to the voice information, and judges whether a response is required according to the operation information corresponding to the voice information.
  • the operation information includes but not limited to operation instructions and control instructions.
  • the operation information also includes classification results obtained by the smart home device according to the feature information, for example, classifying different operation instructions. Smart home devices can perform corresponding operations based on the operation information. Different types of operation information (such as different control instructions) are used to control smart home devices to perform different operations.
  • the smart home device 3 processes the feature information, determines that the operation information corresponding to the voice information is "I want to increase the volume of the TV", and executes the operation corresponding to the operation information, that is, adjust the volume of the TV. high volume.
  • the smart home device 1 desk lamp
  • the smart home device 1 processes the feature information to obtain operation information corresponding to the voice information, and determines not to perform the corresponding operation according to the operation information.
  • the desk lamp discards operational information.
  • the smart home device 2 air conditioner
  • the step of identifying the operation information is completed by each smart home device and does not need to be completed in the mobile phone, thereby reducing the calculation amount of the mobile phone and improving the efficiency of the intelligent voice control process.
  • the mobile phone extracts the feature information corresponding to the voice information, which may be implemented as follows: the mobile phone inputs the voice information corresponding to the voice information into the first model, and the first model outputs the feature information corresponding to the voice information.
  • the first model is used to convert speech information into corresponding feature information.
  • the smart home device processes the feature information from the mobile phone to obtain the operation information (such as control instructions) corresponding to the voice information, which can be realized as follows: the smart home device inputs the feature information from the mobile phone into the second model, and the second model The model outputs the operation information corresponding to the voice information.
  • the second model is used to transform feature information into corresponding operation information. The first model, the second model, and feature information will be described in detail below.
  • the foregoing electronic device may also be referred to as a terminal.
  • the system further includes one or more servers 300 .
  • the server can establish a connection with the electronic device.
  • electronic devices may be connected through a server.
  • the mobile phone 100 can remotely control smart home devices through the server 300 .
  • the first model and the second model can be trained by the server 300. After the server 300 has trained the first model and the second model, it can send the trained first model and the second model to each terminal. In some other embodiments, the first model and the second model may be trained by a terminal, such as a mobile phone.
  • the first model and the second model can be models obtained based on arbitrary algorithms, such as models based on neural networks, such as convolutional neural networks (Convolutional Neural Networks, CNN), recurrent neural networks (Recurrent Neural Networks) , RNN), Deep Neural Networks (Deep Neural Networks, DNN), Multi-Layer Perceptron (Multi-Layer Perceptron, MLP) and a combination of one or more of Gradient Boosting Decison Tree (GBDT).
  • CNN convolutional neural networks
  • Recurrent Neural Networks Recurrent Neural Networks
  • RNN Deep Neural Networks
  • DNN Deep Neural Networks
  • Multi-Layer Perceptron Multi-Layer Perceptron
  • GBDT Gradient Boosting Decison Tree
  • the above-mentioned electronic device 100 and electronic device 200 may be mobile phones, tablet computers, personal computers (personal computers, PCs), personal digital assistants (personal digital assistants, PDAs), smart watches, netbooks, wearable electronic devices, enhanced Real technology (augmented reality, AR) equipment, virtual reality (virtual reality, VR) equipment, vehicle-mounted equipment, smart cars, smart audio, robots, earphones, cameras and other devices that can be used for voice control or controlled by voice
  • the specific forms of the electronic device 100 and the electronic device 200 are not particularly limited.
  • first and “second” in the specification and drawings of the present application are used to distinguish different objects, or to distinguish different processes for the same object. Words such as “first” and “second” can distinguish the same or similar items with basically the same function and effect. For example, the first device and the second device are only used to distinguish different devices, and their sequence is not limited. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not necessarily limit the difference. "At least one” means one or more, and “plurality” means two or more.
  • a and/or B describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural.
  • the character "/" generally indicates that the contextual objects are an "or” relationship.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • FIG. 4 shows a schematic structural diagram of the electronic device 100 .
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is a cache memory.
  • the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • the electronic device 100 processes the voice information to obtain the characteristic information
  • the electronic device 200 processes the characteristic information from the electronic device 100 to obtain the operation information corresponding to the voice information. Part or all of the data processing can also be implemented in the processor 110 in the electronic device 100 .
  • the electronic device 100 is also called a first terminal
  • the electronic device 200 is also called a second terminal.
  • the interface connection relationship between the modules shown in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 .
  • the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is configured to receive a charging input from a charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 can receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100 . While the charging management module 140 is charging the battery 142 , it can also provide power for electronic devices through the power management module 141 .
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives the input from the battery 142 and/or the charging management module 140 to provide power for the processor 110 , the internal memory 121 , the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the wireless communication function of the electronic device 100 can be realized by the antenna 1 , the antenna 2 , the mobile communication module 150 , the wireless communication module 160 , a modem processor, a baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover single or multiple communication frequency bands. Different antennas can also be multiplexed to improve the utilization of the antennas.
  • Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G/6G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signals modulated by the modem processor, and convert them into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be set in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be set in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator sends the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low-frequency baseband signal is passed to the application processor after being processed by the baseband processor.
  • the application processor outputs sound signals through audio equipment (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent from the processor 110, and be set in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (Wireless Fidelity, Wi-Fi) network), bluetooth (bluetooth, BT), global navigation satellite, etc. applied on the electronic device 100.
  • System global navigation satellite system, GNSS
  • frequency modulation frequency modulation, FM
  • near field communication technology near field communication, NFC
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , frequency-modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the electronic device 100 realizes the display function through the GPU, the display screen 194 , and the application processor.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos and the like.
  • the display screen 194 includes a display panel.
  • the electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.
  • the electronic device 100 can realize the shooting function through the ISP, the camera 193 , the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used for processing the data fed back by the camera 193 .
  • the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin color.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be located in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object generates an optical image through the lens and projects it to the photosensitive element.
  • the photosensitive element converts the light signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other image signals.
  • the electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs.
  • the electronic device 100 can play or record videos in various encoding formats, for example: moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function.
  • the internal memory 121 may be used to store computer-executable program codes including instructions.
  • the internal memory 121 may include an area for storing programs and an area for storing data.
  • the stored program area can store an operating system, at least one application program required by a function (such as a sound playing function, an image playing function, etc.) and the like.
  • the storage data area can store data created during the use of the electronic device 100 (such as audio data, phonebook, etc.) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the electronic device 100 can implement audio functions through the audio module 170 , the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 170 may also be used to encode and decode audio signals.
  • the audio module 170 may be set in the processor 110 , or some functional modules of the audio module 170 may be set in the processor 110 .
  • Speaker 170A also referred to as a "horn" is used to convert audio electrical signals into sound signals.
  • Electronic device 100 can listen to music through speaker 170A, or listen to hands-free calls.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the receiver 170B can be placed close to the human ear to receive the voice.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a phone call or sending a voice message, the user can put his mouth close to the microphone 170C to make a sound, and input the sound signal to the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In some other embodiments, the electronic device 100 may be provided with two microphones 170C, which may also implement a noise reduction function in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions, etc.
  • the earphone interface 170D is used for connecting wired earphones.
  • the earphone interface 170D can be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the keys 190 include a power key, a volume key and the like.
  • the key 190 may be a mechanical key. It can also be a touch button.
  • the electronic device 100 can receive key input and generate key signal input related to user settings and function control of the electronic device 100 .
  • the motor 191 can generate a vibrating reminder.
  • the indicator 192 can be an indicator light, and can be used to indicate charging status, power change, and can also be used to indicate messages, missed calls, notifications, and the like.
  • the SIM card interface 195 is used for connecting a SIM card.
  • Fig. 5 shows another exemplary structure of an electronic device.
  • the electronic device includes: a processor 501 , a memory 502 , and a transceiver 503 .
  • the electronic device includes: a processor 501 , a memory 502 , and a transceiver 503 .
  • the transceiver 503 is used for the electronic device to interact with other devices (such as the electronic device 100).
  • the transceiver 503 may be a device based on communication protocols such as Wi-Fi, Bluetooth or other.
  • the structure of the server may refer to the structure shown in FIG. 5 , which will not be repeated here.
  • the electronic device or server may include more or fewer components than shown in the illustration, or combine some components, or split some components, or replace some components, or arrange different components .
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the mobile phone includes a first model, and each smart home device includes a second model.
  • the first model is trained and deployed on mobile phones by mobile phone manufacturers.
  • the first model can be used to obtain multi-dimensional feature information corresponding to speech information.
  • the weight of the first model is usually fixed, and the first model does not need to be updated frequently.
  • the second model is self-trained by the manufacturers of each smart home device.
  • the second model can be used to convert the multi-dimensional feature information corresponding to the voice information into corresponding operation information.
  • the manufacturers of each smart home device can update it according to actual usage needs. That is to say, in the case of a new device in the later stage, it is usually only necessary for the manufacturer of the new device to retrain the second model for identifying operation information (such as classifying control commands) in the new device, and the mobile phone manufacturer does not need to Frequently updating the first model used to identify feature information can reduce model training and maintenance costs for manufacturers such as mobile phones.
  • the second model since the second model only involves identifying operation information of a specific device (such as classifying control instructions), the model is relatively small, which is convenient for training and updating.
  • updating the second model includes: updating weights of the second model.
  • the user inputs voice information "turn up the volume of the TV" to the mobile phone.
  • the mobile phone detects the voice information input by the user, it can input the voice information into the first model, and the first model outputs the characteristics of the voice information. information.
  • the feature information of the speech information may be output in a form such as but not limited to a feature matrix.
  • the mobile phone can send the feature information to each smart home device connected to the mobile phone (such as a desk lamp, an air conditioner, and a TV as shown in FIG. 6 ).
  • the smart home device After the smart home device receives the feature information from the mobile phone, it can input the feature information into the second model, and the second model outputs the operation information corresponding to the voice information. As shown in FIG. 6 , the TV can recognize that the operation information (categorization result of the control command) according to the second model is to turn up the volume. In this way, the TV can perform the operation corresponding to the operation information according to the recognized operation information, that is, Adjust volume.
  • the desk lamp cannot recognize the operation information, or the operation information output by the desk lamp through the second model does not match itself, or the operation information output by the desk lamp through the second model does not match the operation information (control instructions) that can be executed by itself.
  • the desk lamp According to the operation information (such as control instructions) output by the second model, it can be determined that the user's voice information is not used to control itself, and the desk lamp does not need to respond to the user's voice information. Similarly, the air conditioner doesn't have to respond to the user's voice messages.
  • the mobile phone needs to complete the process from feature information extraction to operation information recognition, which leads to a large amount of calculation on the mobile phone and low voice control efficiency. Decoupled from the operation information identification process. Among them, the feature information extraction process is performed by the mobile phone, and the operation information identification is performed by each smart home device. Compared with the existing technology, the mobile phone no longer performs the operation of operating information recognition, and the amount of calculation is reduced, which can increase the operating speed of the mobile phone, thereby improving the efficiency of voice control.
  • the voice control method provided by the embodiment of the present application includes:
  • the mobile phone detects that a user inputs voice information.
  • the voice information input by the user is "turn up the volume of the TV".
  • the mobile phone converts the voice information into characteristic information.
  • voice information is an analog signal.
  • the mobile phone needs to convert it into a digital signal through a coding model and extract feature information. Subsequently, other devices can recognize the operation information corresponding to the voice information based on the extracted feature information.
  • the feature information refers to distinguishable components obtained from speech information, through which distinguishable speech components can accurately describe the difference between a speech and other speech.
  • the identifiable components in the speech information include but not limited to the sound spectrum and the phonemes of the sound spectrum.
  • the phonemes of the sound spectrum include, but are not limited to, formants in the sound spectrum.
  • the feature information in this embodiment of the present application is not limited to the ones listed above. This embodiment of the present application is limited to space and does not exhaustively list all the feature information. Any information that can play a role in identifying speech can be called feature information.
  • the mobile phone includes a first model, and the mobile phone can input voice information into the first model, and the first model calculates and outputs feature information corresponding to the voice information.
  • the training method of the first model may refer to the following embodiments.
  • the first model can be implemented as an encoding model (also called an encoding module, an encoding neural network, or other names), and the name does not constitute a limitation on the encoding model.
  • the encoding model can be regarded as a functional module on the mobile phone, and the functional module is used to convert the information corresponding to the voice information into the feature information of the voice.
  • the first model may also integrate other models, such as a VAD model.
  • the first model may also integrate other functions, and this embodiment of the present application does not limit whether the first model integrates other functions and specific types of other functions.
  • the second model in the embodiment of the present application can be implemented as a decoding model (also called a decoding module, a decoding neural network, or other names).
  • the decoding model can also be regarded as a functional module in a device (such as a TV), and the functional module is used to convert speech feature information into corresponding operation information.
  • the first model may also be referred to as a first model file, or by other names.
  • the second model may also be referred to as a second model file, or other names.
  • the name does not constitute a limitation on the first model and the second model.
  • the mobile phone broadcasts feature information.
  • each device connected to the mobile phone such as the TV receives feature information from the mobile phone.
  • the mobile phone does not perform the operation of identifying the operation information, but the devices controlled by the mobile phone complete the identification of the operating information. Therefore, the mobile phone does not know which device the user's voice information is used to control. Then, the mobile phone needs to broadcast the feature information to all connected devices, and after the other devices recognize the operation information corresponding to the feature information, they can judge whether the user’s voice information is used to control their own operations, and if so, respond to the user’s voice information and execute The operation corresponding to the voice information, if not, does not respond to the user's voice information, and does not perform the operation corresponding to the voice information.
  • the TV converts the feature information into operation information corresponding to the TV.
  • the television includes a second model.
  • the TV receives the voice feature information (such as the voice feature matrix) from the mobile phone, it inputs the voice feature information into the second model, and the second model determines and outputs the operation information corresponding to the voice information.
  • the TV inputs speech feature information (such as a feature matrix) into the second model, and the second model calculates and determines that the operation information corresponding to the feature information is "turn up the volume of the TV".
  • the television responds to the operation information, and performs an operation corresponding to the operation information.
  • the TV recognizes that the operation information corresponding to the user's voice information is "turn up the volume of the TV", and after the operation information is the operation information matched by the TV, it can respond to the operation information, Perform the target operation corresponding to the operation information, that is, turn up the volume.
  • the voice control method of the embodiment of the present application includes:
  • the mobile phone detects the voice information, and inputs the voice information into the VAD model.
  • the user inputs voice information "turn up the volume of the TV" to the mobile phone, and after detecting the voice information, the mobile phone inputs the voice information corresponding to the voice information into the VAD model.
  • the VAD model detects the human voice information in the voice information, and inputs the human voice information in the voice information into the coding model of the mobile phone.
  • the mobile phone may also collect other sounds in the environment when collecting the voice information. Therefore, in order to reduce the amount of data processing in the subsequent calculation process and avoid the interference of environmental noise, the mobile phone can use the VAD model to identify all sounds. Human voice information and non-human voice information (noise) in the voice information are collected.
  • the VAD model may be any type of model capable of performing speech classification tasks.
  • the collected original voice information can be divided into multiple segments (frames), such as 20 ms or 25 ms frames, and the voice information is input into the VAD model, and the VAD outputs the classification result of the voice information.
  • the VAD model outputs a classification result of whether each sub-frame belongs to human voice or non-human voice, and takes the speech belonging to human voice as an input of the subsequent coding model.
  • the VAD model can also be regarded as a functional module on the mobile phone, and the functional module has the function of recognizing human voice and non-human voice.
  • the encoding model outputs feature information corresponding to the voice information according to the human voice information in the voice information.
  • the coding model divides the vocal information into multiple frames, and for each obtained frame, the coding model extracts The characteristic information of human voice information.
  • the encoding model can convert the extracted feature information into a feature vector.
  • Preprocessing includes but is not limited to: dividing the speech information into multiple sub-frames. Afterwards, for each frame, perform the following operations:
  • the spectrum corresponding to the sub-frame is obtained by fast Fourier transform (FFT), and the obtained spectrum is processed by the Mel filter bank to obtain the Mel spectrum corresponding to the sub-frame.
  • FFT fast Fourier transform
  • the linear natural frequency spectrum can be converted into a Mel frequency spectrum that reflects the characteristics of human hearing.
  • cepstrum analysis is performed on the Mel spectrum corresponding to the sub-frame to obtain the corresponding MFCC, which can be used as feature information corresponding to the speech information of the sub-frame.
  • the feature information of each sub-frame of the speech can be combined to obtain feature information (such as a feature vector) corresponding to the speech information.
  • the encoding model may use other methods for extracting feature information, and is not limited to the methods listed above.
  • the communication module of the mobile phone obtains feature information corresponding to the voice information.
  • the communication module is used to support the mobile phone to communicate with other electronic devices.
  • the communication module can be connected to a network via wireless communication or wired communication to communicate with other personal terminals or a network server.
  • the wireless communication may employ at least one of cellular communication protocols such as 5G, Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Universal Mobile Telecommunications System (UMTS), Wireless Broadband (WiBro) or Global System for Mobile Communications (GSM).
  • Wireless communications may include, for example, short-range communications.
  • the short-range communication may include at least one of wireless fidelity (Wi-Fi), Bluetooth, near field communication (NFC), magnetic stripe transmission (MST), or GNSS.
  • the processing module (such as a processor) in the mobile phone can obtain the output result of the above encoding module, that is, obtain the feature information corresponding to the voice information, and send the feature information corresponding to the voice information to the communication module,
  • the following step S205 is executed by the communication module of the mobile phone.
  • the communication module of the mobile phone broadcasts feature information corresponding to the voice information.
  • the communication module of the TV receives the feature information of the voice from the mobile phone.
  • the decoding model of the television obtains feature information corresponding to the voice information.
  • the communication module of the TV after receiving feature information corresponding to the voice information from the mobile phone, the communication module of the TV sends the feature information to the processing module of the TV, and the processing module inputs the feature information into the decoding model.
  • the decoding model of the television outputs the operation information corresponding to the feature information according to the feature information corresponding to the voice information.
  • the decoding model may be a model for performing classification tasks, and its output content is operation information corresponding to the speech information.
  • the decoding model in the embodiment of this application is different from the decoder in traditional ASR.
  • the decoder in traditional ASR can convert the feature information corresponding to the voice information into text, and then convert the text into corresponding operations by subsequent functional modules information, the decoding model in the embodiment of the present application may convert and classify feature information corresponding to voice information into corresponding operation information. It can be seen that the conversion efficiency of the decoding model in the embodiment of the present application is higher.
  • the TV receives the feature information (such as a feature matrix) corresponding to the voice information from the mobile phone, it outputs operation information (such as a control command) through a second model (such as a decoding model) "turn up the TV volume" and, according to the operation information, perform the operation corresponding to the operation information, that is, increase the volume.
  • a second model such as a decoding model
  • the air conditioner after the air conditioner receives the feature information corresponding to the voice information from the mobile phone, it outputs the operation information corresponding to the voice information "other (others) types of operation information" through the second model (such as the decoding model) , the operation information (such as a control instruction) indicates that the user's voice information is not used to control the air conditioner. Then, the air conditioner does not perform a corresponding operation according to the operation information. Similarly, after the desk lamp receives the feature information corresponding to the voice information, it outputs corresponding operation information according to the feature information, and determines that no corresponding operation needs to be performed.
  • the operation information such as a control instruction
  • the training of the decoding neural network can be handed over to various third-party manufacturers implement. Different vendors can train their own decoding neural networks. On the one hand, there is no need to frequently train the coding neural network, which greatly reduces the development cost of new equipment. On the other hand, since the mobile phone side only performs the operation of extracting feature information in speech recognition, it does not perform the process of operating information recognition. , therefore, the computing load and power consumption of the mobile phone can be reduced, the computing speed can be increased, and the delay of the speech recognition process can be reduced.
  • the training methods of the above-mentioned first model and the second model are introduced as follows.
  • the first model is a model obtained by training based on at least one first sample data, the first sample data includes: first voice information, and feature information of the first voice information is known.
  • the second model is a model trained based on at least one second sample data, the second sample data includes: first feature information, and the operation information corresponding to the first feature information is known.
  • Fig. 9 exemplarily shows a training method of the first model.
  • the training samples include known voice information (i.e. the first voice information) of the operation information .
  • the training sample also includes a label of the voice data, which is used to represent the operation information corresponding to the voice information, and a model that can be used to extract feature information and identify the operation information corresponding to the voice information can be obtained by training multiple samples.
  • the model can output the operation information corresponding to the voice information.
  • the trained model includes 32 layers of neurons.
  • the L1-L16 layers are used to extract the feature information corresponding to the voice information
  • the L17-L32 layers are used to identify the operation information corresponding to the voice information.
  • a neuron in a certain layer it can be connected with one or more neurons in the next layer, and output corresponding signals through the connection.
  • (1) in Figure 9 shows the weights corresponding to the connections between some neurons in the trained model, for example, the connection between the first neuron of the L1 layer and the first neuron of the L2 layer, the corresponding weight w11, the connection between the first neuron of the L1 layer and the second neuron of the L2 layer, corresponds to the weight w12, and so on.
  • the model in order to improve the recognition accuracy of the model, can be evaluated and tested.
  • the recognition rate of the model reaches a certain threshold, it indicates that the model has been trained.
  • the recognition rate of the model is low, you can continue to train the model until the recognition accuracy of the model reaches a certain threshold.
  • the training process of the model can be performed on the device side (such as a terminal such as a mobile phone) or on the cloud side (such as a server).
  • Training can be offline training or online training.
  • the embodiment of the present application does not limit the specific training method of the model.
  • the first model used to extract feature information includes 16 layers of L1-L16. After inputting voice data (also called voice information) to the first model, the first model can output The feature vector (also called feature information) corresponding to the speech data.
  • the encoder in the embodiment of this application is the encoder part of the trained encoder-decoder model, which is equivalent to the encoder part of an encoder-decoder model Pull it out to form the first model.
  • a training and usage method of the second model is exemplarily shown in FIG. 10 .
  • the first model is obtained first.
  • the mobile phone may upload the first model to the server.
  • other devices can obtain the first model from the server, and train the second model according to the first model.
  • the device may obtain the first model in the mobile phone in other ways, and this embodiment of the present application does not limit the specific method for the device to obtain the first model.
  • the output of the first model is used as an input of the second model to form a neural network for training.
  • the input of the first model is used as the input of the whole neural network
  • the output of the second model is taken as the output of the whole neural network.
  • the weight of the first model remains unchanged.
  • the output of the first model that is, the feature information corresponding to the speech information
  • the second model can be trained according to the training sample.
  • the trained second model has the function of outputting operation information according to the input feature information. Taking the television recognizing the operation information corresponding to the voice information through the second model as an example, as shown in (2) of FIG. The second model, and then the second model outputs the operation information corresponding to the voice information (such as turning up the volume of the TV).
  • the device may also train the second model independently, that is, train the second model without obtaining the first model.
  • the training samples are also feature vectors corresponding to the speech information (an example of the first feature information), and the training samples are trained to obtain the second model.
  • distributed task processing scenarios include, but are not limited to: remote meeting scenarios (including but not limited to real-time translation scenarios), face recognition verification scenarios.
  • the face recognition model may be at least split into a first model and a second model.
  • a camera module (such as a camera) of the mobile phone includes a first model.
  • the first model is used to extract feature information of a human face in a human face image.
  • the second model is included in eg a processing module of the mobile phone.
  • the second model is used to output the recognition result of the human face according to the feature information of the human face.
  • Figure 12 shows an exemplary flow of the method of the embodiment of the present application in the face recognition scenario, the flow includes the following steps:
  • the camera module collects a face image input by a user.
  • the camera module inputs the face image into the first model, and the first model outputs feature information of the face image.
  • the camera module transmits the feature information of the face image to the processing module.
  • the processing module inputs the feature information of the face image into the second model, and the second model outputs a recognition result of the face.
  • the processing module judges whether the face is a legal face according to the recognition result of the face. If yes, execute S306; if not, execute S307.
  • the user inputs a face image
  • the mobile phone will execute the payment operation when the first model in the camera module and the second model in the processing module judge that the face is a legitimate face.
  • the screen unlocking scenario the user inputs a face image
  • the mobile phone will unlock the screen when the first model in the camera module and the second model in the processing module judge that the face is a legal face.
  • the camera takes the camera as a module on the mobile phone as an example.
  • the camera where the first model is located may also be in a module independent of the mobile phone, and the mobile phone includes the second model.
  • the external camera of the mobile phone can complete the face recognition process together with the mobile phone, and since the model has been split into the first model and the second model, the efficiency of face recognition can be improved.
  • a device can split one or more models used to perform one or more tasks into multiple sub-models, and deploy multiple sub-models in multiple modules of the device,
  • the model execution load of a single module is shared by means of the plurality of modules.
  • the embodiment of the present application does not limit the specific splitting method of the model, nor does it limit which modules are distributed and deployed after the model is split into multiple sub-models.
  • the existing model can be split into at least the first model and the second model, the first model is run on the speaking device, and the second model is run on the receiving device.
  • Figure 13 shows an exemplary flow of the method in the embodiment of the present application in the remote conference translation scenario, and the flow includes the following steps:
  • the audio collection module of mobile phone A collects the first language information of the source language (namely the first language), and inputs the first voice information into the first model of mobile phone A.
  • the audio collection module includes but is not limited to a microphone.
  • the first voice information in the source language is English "this meeting is”
  • the audio collection module of mobile phone A collects the English voice information of the speaker, and inputs the English voice information into the first model.
  • the first model extracts feature information of the first voice information.
  • the feature information of the English speech information is extracted, that is, the feature information corresponding to the English speech "this meeting is”.
  • the communication module of the mobile phone A obtains feature information of the first voice information.
  • the communication module of mobile phone A obtains the characteristic information of the first voice information from the first model, or the communication module of mobile phone A obtains the characteristic information of the first voice information from the processing module.
  • the communication module of the mobile phone A sends the characteristic information of the first voice information to the communication module of the mobile phone B.
  • the second model of the mobile phone B obtains feature information of the first voice information.
  • the second model obtains feature information of the first voice information from the communication module.
  • the processing module inputs the characteristic information into the second model, that is, the second model obtains the characteristic information of the first voice information from the processing module.
  • the second model determines subtitle information and/or second voice information in the target language (second language) corresponding to the first voice information according to the feature information of the first voice information.
  • the first language is different from or the same as the second language.
  • mobile phone B after receiving the characteristic information of the first speech information, mobile phone B can automatically input the characteristic information into the second model, and pass the second model , translate the feature information corresponding to the English speech information into Chinese subtitles. For example, according to the English "this meeting is”, output the corresponding Chinese operation information (such as control instructions) "this meeting is” operation information.
  • the speech translation such as English to Chinese
  • the second model outputs the recognition result of operation information in English according to the feature information corresponding to the English voice information. For example, output the corresponding English operation information "this meeting is”.
  • the processing module of the mobile phone B controls to display the subtitle information in the second language, and/or play the second voice information in the second language.
  • the processing module controls the display module to display translated Chinese subtitles "this meeting is”, and the display module can also display English subtitles "this meeting is”.
  • the processing module controls the audio output module (such as a speaker) to play the translated Chinese voice "this meeting is”, and the speaker can also play the English voice "this meeting is”.
  • mobile phone B only needs to run the process of source language feature information-translation results, and does not need to run the process of source language voice information-source language feature information (that is, extracting feature information), which reduces the The computing power of mobile phone B can improve translation efficiency.
  • the first model is a model trained based on at least one first sample data
  • the first sample data includes: first voice information
  • the feature information of the first voice information is already known
  • the second model is a model trained based on at least one second sample data
  • the second sample data includes: first feature information
  • the operation information corresponding to the first feature information is known .
  • more complex parametric models including but not limited to machine learning models
  • non-parametric models can be split into multiple sub-models with smaller granularity, and the multiple sub-models can be run separately In different modules of the same device, or run in different devices in the same network (such as the above-mentioned voice recognition scenario), or run in multiple devices in different networks (such as the above-mentioned remote conference scenario).
  • the embodiment of the present application does not limit the splitting granularity, splitting method, and which modules or devices are deployed after the splitting of the sub-models, which can be flexibly determined according to the characteristics of the scene and device type.
  • this embodiment of the present application can be applied to the scene of bone voiceprint recognition.
  • the bone voiceprint technology one of the biometric technologies, is relatively high in terms of recognition rate, speed, and convenience.
  • the principle of identifying the person's identity is: collect the voice information of the person, and verify the legality of the person's identity according to the voice information. Among them, since everyone's bone structure is unique, the echo of sound reflected between bones is also unique. The echoes of sound reflected between bones can be called bone voiceprints. Similar to the principle that fingerprints can be used to identify different people, bone voiceprints can be used to identify different users.
  • the model for bone voiceprint recognition can be split into two parts, one part (the first model) is set in such as a Bluetooth headset, and the other part (the second model ) is set in the phone.
  • the headset collects the user's voice information (for example, the user enters "unlock the screen")
  • the feature information of the voice signal also called voice information
  • the second model identifies whether the voice is the voice of a legitimate user, and if so, performs a corresponding operation (such as unlocking the screen).
  • FIG. 14 shows the flow of the distributed voice control method provided by the embodiment of the present application. The method is applied to the first terminal, and the method includes:
  • the first terminal In response to the voice information input by the user, the first terminal inputs the voice information into a first model, and obtains feature information corresponding to the voice information through the first model.
  • the first model exists in the first terminal
  • the second model exists in the second terminal.
  • the mobile phone receives the voice information "turn up the volume of the TV" input by the user, and outputs the feature information of the voice information (that is, the feature matrix ).
  • the first terminal sends feature information to the second terminal, so that the second terminal inputs the feature information into the second model, determines operation information corresponding to the voice information through the second model, and performs corresponding operations according to the operation information.
  • the second terminal includes a desk lamp, an air conditioner, and a TV connected to the mobile phone.
  • the mobile phone acquires the characteristic information corresponding to the voice information, it broadcasts the characteristic information to the desk lamp, the air conditioner, and the TV.
  • Desk lamps, air conditioners, and televisions recognize operation information (such as control instructions) through the second model. Wherein, if the operation information identified by the TV matches the TV, the TV performs the target operation corresponding to the operation information of "increase the volume of the TV", that is, increases its own playback volume.
  • some steps in the method embodiments may be equivalently replaced by other possible steps.
  • some steps in the method embodiments may be optional, and may be deleted in some usage scenarios.
  • other possible steps may be added in the method embodiments.
  • the apparatus may be the above-mentioned electronic device (such as a folding screen mobile phone).
  • the apparatus may include: a display screen, memory and one or more processors.
  • the display screen, memory and processor are coupled.
  • the memory is used to store computer program code comprising computer instructions.
  • the processor executes the computer instructions, the electronic device can execute various functions or steps performed by the mobile phone in the foregoing method embodiments.
  • FIG. 4 or FIG. 5 For the structure of the electronic device, reference may be made to the electronic device shown in FIG. 4 or FIG. 5 .
  • the core structure of the electronic device may be represented as the structure shown in FIG. 15 , and the core structure may include: a processing module 1301 , an input module 1302 , a storage module 1303 , and a display module 1304 .
  • the components of FIG. 15 are exemplary only, and the electronic device may include more or fewer components than shown, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processing module 1301 may include at least one of a central processing unit (CPU), an application processor (Application Processor, AP) or a communication processor (Communication Processor, CP).
  • the processing module 1301 may perform operations or data processing related to control and/or communication of at least one of other elements of the user electronic device.
  • the processing module 1301 can be configured to control the content displayed on the main screen according to a certain trigger condition. Or determine what is displayed on the screen according to preset rules.
  • the processing module 1301 is also used to process the input instruction or data, and determine the display style according to the processed data.
  • the processing module 1301 is configured to input the voice information into the first model in response to the voice information input by the user. , and obtain feature information corresponding to the speech information through the first model.
  • the processing module 1301 is configured to input the feature information into the second model, and pass the second model determining operation information corresponding to the voice information;
  • a processing module configured to perform corresponding operations according to the operation information.
  • the second terminal performs corresponding operations according to the operation information, including:
  • the second terminal If it is determined that the operation information corresponding to the voice information is the operation information matched by the second terminal, the second terminal performs a target operation according to the operation information corresponding to the voice information, and/or, if it is determined that the voice information If the corresponding operation information is not the operation information matched by the second terminal, the second terminal discards the operation information.
  • the input module 1302 is configured to obtain instructions or data input by the user, and transmit the obtained instructions or data to other modules of the electronic device.
  • the input mode of the input module 1302 may include touch, gesture, approaching the screen, etc., and may also be voice input.
  • the input module may be a screen of an electronic device, acquires user input operations, generates input signals according to the acquired input operations, and transmits the input signals to the processing module 1301 .
  • the storage module 1303 may include a volatile memory and/or a nonvolatile memory.
  • the storage module is used to store at least one related instruction or data among other modules of the user terminal equipment, specifically, the storage module can store the first model and the second model.
  • the display module 1304 may include, for example, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, an Organic Light Emitting Diode (OLED) display, a Micro Electro Mechanical System (MEMS) display or an electronic paper display. Used to display user-viewable content (eg, text, images, videos, icons, symbols, etc.).
  • LCD Liquid Crystal Display
  • LED Light Emitting Diode
  • OLED Organic Light Emitting Diode
  • MEMS Micro Electro Mechanical System
  • the structure shown in FIG. 15 may further include an output module (not shown in FIG. 15 ).
  • Output modules can be used to output information. Exemplarily, voice information is played and output.
  • Output modules include but are not limited to modules such as speakers.
  • the structure shown in FIG. 15 may also include a communication module 1305, which is used to support the electronic device to communicate with other electronic devices.
  • the communication module can be connected to a network via wireless communication or wired communication to communicate with other personal terminals or a network server.
  • the wireless communication may employ at least one of cellular communication protocols such as 5G, Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Universal Mobile Telecommunications System (UMTS), Wireless Broadband (WiBro) or Global System for Mobile Communications (GSM).
  • Wireless communications may include, for example, short-range communications.
  • the short-range communication may include at least one of wireless fidelity (Wi-Fi), Bluetooth, near field communication (NFC), magnetic stripe transmission (MST), or GNSS.
  • the communication module 1305 is configured to send the characteristic information to the second terminal.
  • sending the feature information to the second terminal includes: broadcasting the feature information.
  • the communication module 1305 is configured to receive feature information corresponding to the voice information from the first terminal.
  • the embodiment of the present application also provides a chip system, as shown in FIG. 16 , the chip system includes at least one processor 1401 and at least one interface circuit 1402 .
  • the processor 1401 and the interface circuit 1402 may be interconnected through wires.
  • interface circuit 1402 may be used to receive signals from other devices, such as memory of an electronic device.
  • the interface circuit 1402 may be used to send signals to other devices (such as the processor 1401).
  • the interface circuit 1402 can read instructions stored in the memory, and send the instructions to the processor 1401 .
  • the electronic device may be made to execute various steps in the foregoing embodiments.
  • the chip system may also include other discrete devices, which is not specifically limited in this embodiment of the present application.
  • the embodiment of the present application also provides a computer storage medium, the computer storage medium includes computer instructions, and when the computer instructions are run on the above-mentioned electronic device, the electronic device is made to perform various functions or steps performed by the mobile phone in the above-mentioned method embodiment.
  • the embodiment of the present application also provides a computer program product, which, when the computer program product is run on the computer, causes the computer to execute various functions or steps performed by the mobile phone in the above method embodiments.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or It may be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may be one physical unit or multiple physical units, which may be located in one place or distributed to multiple different places. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.

Abstract

一种分布式语音控制方法及电子设备,涉及终端技术领域,可以提升语音控制的效率。该方法包括:第一终端响应于用户输入的语音信息,将所述语音信息输入第一模型,并通过所述第一模型获得所述语音信息对应的特征信息;所述第一终端向第二终端发送所述特征信息,以使得所述第二终端中将所述特征信息输入第二模型,并通过所述第二模型确定所述语音信息对应的操作信息,以及根据所述操作信息执行相应操作,所述第一模型存在于所述第一终端,所述第二模型存在于所述第二终端。

Description

分布式语音控制方法及电子设备
本申请要求于2021年10月22日提交国家知识产权局、申请号为202111234615.7、发明名称为“分布式语音控制方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及分布式语音控制方法及电子设备。
背景技术
随着智能设备的普及,越来越多用户可以在各种智慧场景使用各种智能设备。其中,智慧场景包括语音控制场景。在语音控制场景中,可以通过某个电子设备对该分布式语音控制的其他设备进行语音控制。比如,在图1所示场景中,用户向手机输入语音信息“打开电视”,手机解析该语音信息所表示的操作信息(即用户想要打开电视),并生成控制信号,将控制信号发给电视,以便控制电视打开。
在一些方案中,手机可以借助机器学习模型解析用户的语音信息。但是,由于不同设备可能来自不同厂商,因此,在有新类型或来自新厂商的设备与手机建立无线连接的情况下,手机厂商通常需要重新训练机器学习模型,以便模型能够正确解析用于控制该新类型或新厂商的设备的语音信息。可见,现有技术中,频繁重新训练模型将导致手机厂商开发量大,手机厂商需要后期不断重新训练和维护整个模型。并且,手机中运行的模型复杂,负载较重,导致处理时延较高,语音控制的效率较低。
发明内容
本申请提供分布式语音控制方法及电子设备,可以提升语音控制的效率。
为了实现上述目的,本申请实施例提供了以下技术方案:
第一方面提供一种分布式语音控制方法,可以应用于第一终端或能够实现第一终端功能的组件(比如芯片系统)中,第一终端响应于用户输入的语音信息,将语音信息输入第一模型,并通过第一模型获得语音信息对应的特征信息,所述第一模型存在于所述第一终端;第一终端向第二终端发送特征信息,以使第二终端中将特征信息输入第二模型,并通过第二模型确定语音信息对应的操作信息,以及根据操作信息执行相应操作,所述第二模型存在于所述第二终端。
与现有技术中,第一终端(比如手机)需完成由语音的特征提取到操作信息识别的过程,导致第一终端的计算量大,语音控制的效率低相比,本申请的技术方案,在诸如智能家居设备的语音控制场景中,将特征信息的提取与操作信息的识别过程解耦。比如,可以将用于语音控制的完整模型至少拆分为第一模型和第二模型。其中,第一模型存在于第一终端中,第一终端可以通过第一模型提取语音信息对应的特征信息。第二模型存在于第二终端中,第二终端可以通过第二模型(比如手机控制的各智能家居设备)识别操作信息。由于第一终端不再执行语音控制中的全部步骤,比如不再进行操作信息识别的操作,因此,计算量有所降低,能够提升第一终端的运行速度,进而提高语音控制的效率。
在一种可能的设计中,第一模型是基于至少一个第一样本数据训练得到的模型, 第一样本数据包括:第一语音信息,第一语音信息的特征信息是已知的,和/或,
第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,第一特征信息对应的操作信息是已知的。
在一种可能的设计中,第一终端、至少一个第二终端在同一局域网中;
或者,第一终端、至少一个第二终端在不同局域网中。
在一种可能的设计中,第一终端向第二终端发送特征信息,包括:第一终端向第二终端广播特征信息。
在一种可能的设计中,语音信息对应的特征信息,包括语音信息对应的声谱、声谱的音素。
第二方面提供一种分布式语音控制方法,方法包括:
第二终端从第一终端接收语音信息对应的特征信息;特征信息是第一终端将语音信息输入第一模型,并通过第一模型获得的,所述第一模型存在于所述第一终端;
第二终端将特征信息输入第二模型,并通过第二模型确定语音信息对应的操作信息,所述第二模型存在于所述第二终端;
第二终端根据操作信息执行相应操作。
在一种可能的设计中,第二终端根据操作信息执行相应操作,包括:
若确定语音信息对应的操作信息为第二终端匹配的操作信息,则第二终端根据语音信息对应的操作信息执行目标操作;和/或,
若确定语音信息对应的操作信息不是第二终端匹配的操作信息,则第二终端丢弃操作信息。
在一种可能的设计中,第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,第一语音信息的特征信息是已知的;和/或,第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,第一特征信息对应的操作信息是已知的。
在一种可能的设计中,第一终端与第二终端在同一局域网中,或者,第一终端与第二终端在不同局域网中。
在一种可能的设计中,语音信息对应的特征信息,包括语音信息对应的声谱、声谱的音素。
第三方面提供一种语音识别方法,可以应用于第一终端或实现第一终端功能的组件(比如芯片系统)中。以第一终端实现该方法为例,该方法包括:
第一终端接收用户输入的第一语言的第一语音信息;
所述第一终端响应于所述第一语音信息,将所述第一语音信息输入第一模型,并通过所述第一模型获得所述第一语音信息对应的特征信息;所述第一模型存在于所述第一终端;
所述第一终端向第二终端发送所述特征信息,以使得所述第二终端将所述特征信息输入第二模型,并通过所述第二模型确定所述第一语音信息对应的字幕信息,所述第二模型存在于所述第二终端。
在一种可能的设计中,所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/ 或,所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
在一种可能的设计中,所述字幕信息为第二语言的字幕信息。
在一种可能的设计中,所述第一语言与所述第二语言不同。
该方法可应用在语音转字幕的场景中,比如远程会议中,第二终端可能需要将使用第一终端的说话者的语音信息生成字幕,并显示在屏幕上,以便于更清晰的了解、获知使用第一终端的说话者的讲话内容。进一步的,在第二终端开启语音翻译功能的情况下,第二终端可以根据第一语音信息的特征信息,将使用第一终端的说话者的第一语音信息(比如英文的语音信息)翻译为相应语种(比如中文)的字幕,进而能够让使用第二终端的用户更加了解对端说话者的讲话含义。
此外,由于语音转字幕的操作由第一终端和第二终端共同实现,第一终端无需负责将语音信息转化为相应操作信息,因此,第一终端的计算量有所降低,能够提升第一终端的运行速度,进而提升语音转字幕的效率。
在一种可能的设计中,所述第二终端包括开启语音翻译功能的终端。
在一种可能的设计中,所述第一终端向第二终端发送所述特征信息,包括:所述第一终端广播所述特征信息。
第四方面提供一种语音识别方法,可以应用于第二终端或实现第二终端功能的组件(比如芯片系统)中。以第二终端实现该方法为例,该方法包括:
第二终端接收第一语音信息对应的特征信息;所述第一语音信息为第一语言的语音信息;
所述第二终端将所述特征信息输入第二模型,并通过所述第二模型确定所述第一语音信息对应的字幕信息;所述第二模型存在于所述第二终端。
在一种可能的设计中,所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/或,所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
在一种可能的设计中,所述字幕信息为第二语言的字幕信息。
在一种可能的设计中,所述第一语言与所述第二语言不同。
在一种可能的设计中,所述方法还包括:确定所述第一语音信息对应的第二语言的第二语音信息,并播放第二语言的第二语音信息。其中,所述第一语言与所述第二语言不同。类似于同声传译,该方案中,第二终端可以将使用第一终端的说话者的第一语音信息(英文语音信息)翻译为第二语音信息(中文语音信息),并播放第二语音信息,同时还可以显示相应语种的字幕(比如中文字幕)。或者,第二终端也可以播放双语种的语音信息,同时显示双语种的字幕。或者,第二终端播放单语种的语音信息,显示双语种的字幕信息,或者,第二终端播放双语种的语音信息,显示单语种的字幕信息。本申请的技术方案对此不做限制。
在一种可能的设计中,所述第二终端包括开启语音翻译功能的终端。
在一种可能的设计中,所述第一终端向第二终端发送所述特征信息,包括:所述第二终端广播所述特征信息。
第五方面提供一种第一终端,包括:
处理模块,用于响应于用户输入的语音信息,将语音信息输入第一模型,并通过第一模型获得语音信息对应的特征信息;所述第一模型存在于所述第一终端;
通信模块,用于向第二终端发送特征信息,以使得所述第二终端将特征信息输入第二模型,并通过第二模型确定语音信息对应的操作信息,以及根据操作信息执行相应操作,所述第二模型存在于所述第二终端。
在一种可能的设计中,第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,第一语音信息的特征信息是已知的;和/或,第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,第一特征信息对应的操作信息是已知的。
在一种可能的设计中,第一终端、至少一个第二终端在同一局域网中;
或者,第一终端、至少一个第二终端在不同局域网中。
在一种可能的设计中,通信模块,用于向第二终端发送特征信息,包括:第一终端广播特征信息。
在一种可能的设计中,语音信息对应的特征信息,包括语音信息对应的声谱、声谱的音素。
第六方面提供一种第二终端,包括:
通信模块,用于从第一终端接收语音信息对应的特征信息;特征信息是第一终端将语音信息输入第一模型,并通过第一模型获得的;所述第一模型存在于所述第一终端;
处理模块,用于将特征信息输入第二模型,并通过第二模型确定语音信息对应的操作信息;所述第二模型存在于所述第二终端;
处理模块,用于根据操作信息执行相应操作。
在一种可能的设计中,第二终端根据操作信息执行相应操作,包括:
若确定语音信息对应的操作信息为第二终端匹配的操作信息,则第二终端根据语音信息对应的操作信息执行目标操作;和/或,若确定语音信息对应的操作信息不是第二终端匹配的操作信息,则第二终端丢弃操作信息。
在一种可能的设计中,第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,第一语音信息的特征信息是已知的;和/或,第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,第一特征信息对应的操作信息是已知的。
在一种可能的设计中,第一终端与第二终端在同一局域网中,或者,第一终端与第二终端在不同局域网中。
在一种可能的设计中,语音信息对应的特征信息,包括语音信息对应的声谱、声谱的音素。
第七方面提供一种第一终端,包括:
输入模块,用于接收用户输入的第一语言的第一语音信息;
处理模块,用于响应于所述第一语音信息,将所述第一语音信息输入第一模型,并通过所述第一模型获得所述第一语音信息对应的特征信息;所述第一模型存在于所 述第一终端;
通信模块,用于向第二终端发送所述特征信息,以使得所述第二终端中将所述特征信息输入第二模型,并通过所述第二模型确定所述第一语音信息对应的字幕信息;所述第二模型存在于所述第二终端。
在一种可能的设计中,所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/或,所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
在一种可能的设计中,所述字幕信息为第二语言的字幕信息。
在一种可能的设计中,所述第一语言与所述第二语言不同。
在一种可能的设计中,所述第二终端包括开启语音翻译功能的终端。
在一种可能的设计中,所述通信模块,用于向所述第二终端发送所述特征信息,包括:广播所述特征信息。
第八方面提供一种第二终端,包括:
输入模块,用于接收第一语音信息对应的特征信息;所述第一语音信息为第一语言的语音信息;所述特征信息是第一终端将第一语音信息输入第一模型,并通过第一模型获得的;所述第一模型存在于所述第一终端;
处理模块,用于将所述特征信息输入第二模型,并通过所述第二模型确定所述第一语音信息对应的字幕信息,所述第二模型存在于所述第二终端。
在一种可能的设计中,所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/或,所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
在一种可能的设计中,所述字幕信息为第二语言的字幕信息。
在一种可能的设计中,所述第一语言与所述第二语言不同。
在一种可能的设计中,所述处理模块,还用于确定所述第一语音信息对应的第二语言的第二语音信息;
输出模块,用于播放第二语言的第二语音信息。其中,所述第一语言与所述第二语言不同。
在一种可能的设计中,所述第二终端包括开启语音翻译功能的终端。
在一种可能的设计中,所述通信模块,用于向所述第二终端发送所述特征信息,包括:广播所述特征信息。
第九方面提供一种电子设备,该电子设备具有实现如上述任意方面及其中任一种可能的实现方式中的分布式语音控制方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第十方面提供一种计算机可读存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行如上述任意方面及其中任一种可能的实现方式中任一项的分布式语音控制方法。
第十一方面提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行如任意方面及其中任一种可能的实现方式中任一项的分布式语音控制方法。
第十二方面提供一种电路系统,电路系统包括处理电路,处理电路被配置为执行如上述任意方面及其中任一种可能的实现方式中的分布式语音控制方法。
第十三方面提供一种第一终端,包括:显示屏;一个或多个处理器;一个或多个存储器;存储器存储有一个或多个程序,当一个或者多个程序被处理器执行时,使得第一终端执行上述任一方面任一的方法。
第十四方面提供一种第二终端,包括:显示屏;一个或多个处理器;一个或多个存储器;存储器存储有一个或多个程序,当一个或者多个程序被处理器执行时,使得第二终端执行如上述任一方面任一设计的方法。
第十五方面提供一种芯片系统,包括至少一个处理器和至少一个接口电路,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,当至少一个处理器执行指令时,至少一个处理器执行如上述任意方面及其中任一种可能的实现方式中的分布式语音控制方法。
附图说明
图1为本申请实施例提供的语音控制方法的流程示意图;
图2A、图2B为本申请实施例提供的语音控制方法的流程示意图;
图3为本申请实施例提供的系统的架构示意图;
图4、图5为本申请实施例提供的电子设备的结构示意图;
图6-图8为本申请实施例提供的语音控制方法的流程示意图;
图9为本申请实施例提供的第一模型的训练方法示意图;
图10为本申请实施例提供的第二模型的训练方法示意图;
图11、图12为本申请实施例提供的人脸识别方法的流程示意图;
图13为本申请实施例提供的语音信息翻译方法的流程示意图;
图14为本申请实施例提供的语音控制方法的流程示意图;
图15为本申请实施例提供的装置的示意图;
图16为本申请实施例提供的芯片系统的示意图。
具体实施方式
图2A示出了现有的一种语音识别流程,以用户通过手机语音控制电视调高音量为例,手机将语音信息“调高电视的音量”输入语音边界检测(voice activity detection,VAD)模型,由VAD模型截取语音中的人声(speech),并将语音中的人声作为自动语音识别(automatic speech recognition,ASR)模型的输入。ASR模型将输入的声音信号转换为文字并输出。文字经过自然语言理解(natural language understanding,NLU)模型或正则匹配,转换为文字对应的用户操作信息。之后,手机根据用户操作信息(即调高电视的音量),生成控制信号,并将控制信号发送给电视,电视根据控制信号调高音量。
在图2A对应的实现方式中,如果有新类型的设备(比如与手机属于不同厂商的设备)与手机建立连接,那么,考虑到手机与新设备之间的兼容性等因素,手 机厂商可以重新训练用于语音识别的NLU模型或更新正则匹配。重新训练的NLU模型或正则匹配可以打包在用于实现语音控制的应用程序(比如用于管控智能家居的应用程序)的安装包中,以便用户可以通过更新应用程序等方式下载新版本应用程序到手机上,进而通过新应用程序使用相关模型处理人工智能任务(比如语音识别任务)。示例性的,手机当前与电视、音箱连接。手机可以通过智能家居APP控制电视、音箱。手机检测到新类型的设备(比如智能台灯)与自身建立连接,并将检测到新类型设备的消息上报至服务器。手机厂商通过服务器获知有新类型的设备与手机建立连接后,重新训练NLU模型。手机厂商训练好模型后,可以将训练好的模型打包在智能家居APP的安装包中,并将更新的智能家居APP存储在服务器中。用户可以下载更新的智能家居APP到手机中,手机并通过该更新的APP,控制网络中新增的智能台灯,比如,用户可以通过语音信息控制智能台灯开启、关闭、调节智能台灯的亮度。
从用户的角度,目前的技术方案中,手机厂商频繁训练模型,意味着用户需要频繁更新应用,用户体验感差。从手机角度,目前的技术方案中,手机处理语音识别任务,通常需要完成包括操作信息识别在内的任务,致使手机的负载较高,处理延时也较高,语音控制的效率较低。
图2B给出了现有的另一种语音识别方案。该方案中,使用口语理解(spoken language understanding,SLU)模型替代上述ASR模型以及NLU模型(或正则匹配)。SLU模型可以直接将声音信号转换为用户操作信息。该方案,虽然能够直接将声音信号转换为用户操作信息,但是,在检测到有新类型设备与手机建立连接时,仍需重新训练SLU模型,后期的模型维护成本仍然较高。其次,随着与手机建立连接的设备的类型、数目增多,SLU模型需要识别的操作信息越来越多,需要复杂的模型结构支撑,导致手机运行速度慢。此外,SLU模型需要精确的将语音命令作为输入,日常闲聊时容易产生误识别。
上述图2A、图2B的技术方案,手机端均需完成包括操作信息识别在内的繁多任务,使得手机的负载高,并且,每次检测到新类型设备与手机建立连接时,手机厂商均需重新开发训练新的神经网络,以匹配新类型设备。可见,现有的语音识别方案中,手机的负载均较高,且处理延时也较高,导致语音控制的效率较低。
为了提升语音控制的效率,本申请实施例提供一种语音识别方法。该方法可适用于需要进行语音控制的系统中。如图3所示,为本申请实施例提供的一种系统架构的示例图,该系统包括一个或多个电子设备,比如电子设备100和电子设备200(比如智能家居设备1-3)。
其中,电子设备之间可以建立连接关系。可选的,设备之间建立连接的方式包括但不限于如下一种或多种:通过扫描二维码或条形码建立通信连接、通过无线保真(wireless fidelity,Wi-Fi)协议,蓝牙等通信协议建立连接、通过近距离通信服务(nearby service)建立连接。建立通信连接之后,设备之间可以进行数据和/或信令传输。本申请实施例并不限制电子设备之间建立连接的方式。
在一些场景中,可以由某个设备对连接的其他设备进行语音控制。以通过手机语音控制智能家居设备为例,用户向手机100输入语音信息“调高电视的音量”, 手机100提取该语音信息对应的特征信息,并将特征信息发送给手机100连接的智能家居设备1-3,由智能家居设备1-3对特征信息进行处理,得到该语音信息对应的操作信息,并根据语音信息对应的操作信息判断是否需要进行响应。应理解,操作信息包括但不限于操作指令,控制指令。可选的,操作信息还包括智能家居设备根据特征信息得到的分类结果,比如,对不同的操作指令进行分类。智能家居设备可以根据操作信息执行相应操作。不同类型的操作信息(比如不同的控制指令),用于控制智能家居设备执行不同的操作。
具体的,智能家居设备3(电视)接收特征信息后,对特征信息进行处理,确定语音信息对应的操作信息是“想要调高电视的音量”,并执行该操作信息对应的操作,即调高音量。智能家居设备1(台灯)从手机100接收特征信息之后,对特征信息进行处理,得到语音信息对应的操作信息,并根据操作信息确定不执行相应操作。可选的,台灯丢弃操作信息。类似的,智能家居设备2(空调)也不执行相应操作。该过程中,识别操作信息(操作指令)的步骤是由各智能家居设备完成,无需在手机中完成,从而能够降低手机的计算量,提升智能语音控制流程的效率。
可选的,手机提取语音信息对应的特征信息,可以实现为:手机将语音信息对应的语音信息输入第一模型,由第一模型输出语音信息对应的特征信息。第一模型,用于将语音信息转化为对应的特征信息。
可选的,智能家居设备对来自手机的特征信息进行处理,得到语音信息对应的操作信息(比如控制指令),可以实现为:智能家居设备将来自手机的特征信息输入第二模型,由第二模型输出语音信息对应的操作信息。第二模型,用于将特征信息转化为对应的操作信息。第一模型、第二模型以及特征信息等将在下文中给出详细介绍。
在本申请实施例中,上述电子设备还可以称为终端。
可选的,该系统还包括一个或多个服务器300。服务器可以与电子设备建立连接。在一些实施例中,电子设备之间可以通过服务器进行连接。比如,在图1所示系统中,手机100可通过服务器300对智能家居设备进行远程控制。
在一些实施例中,第一模型、第二模型可以由服务器300训练,服务器300训练完第一模型、第二模型后,可以将训练好的第一模型、第二模型下发至各个终端。在另一些实施例中,第一模型、第二模型可以由终端训练,比如由手机训练。
可选的,第一模型、第二模型可以是基于任意算法得到的模型,比如可以是基于神经网络的模型,可以为卷积神经网络(Convolutional Neural Networks,CNN)、循环神经网络(Recurrent Neural Network,RNN)、深度神经网络(Deep Neural Networks,DNN)、多层感知器(Multi-Layer Perceptron,MLP)和梯度提升树(Gradient Boosting Decison Tree,GBDT)中的一种或多种的组合。
示例性的,上述电子设备100、电子设备200可以为手机、平板电脑、个人计算机(personal computer,PC)、个人数字助理(personal digital assistant,PDA)、智能手表、上网本、可穿戴电子设备、增强现实技术(augmented reality, AR)设备、虚拟现实(virtual reality,VR)设备、车载设备、智能汽车、智能音响、机器人、耳机、摄像头等可以用于语音控制或被语音控制的设备,本申请对该电子设备100、电子设备200的具体形式不做特殊限制。
本申请的说明书以及附图中的术语“第一”和“第二”等是用于区别不同的对象,或者用于区别对同一对象的不同处理。“第一”、“第二”等字样可以对功能和作用基本相同的相同项或相似项进行区分。例如,第一设备和第二设备仅仅是为了区分不同的设备,并不对其先后顺序进行限定。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
此外,本申请的描述中所提到的术语“包括”和“具有”以及它们的任何变形,操作信息在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括其他没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
需要说明的是,本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
本申请的说明书以及附图中“的(英文:of)”,相应的“(英文corresponding,relevant)”和“对应的(英文:corresponding)”有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。
以电子设备100为手机为例,图4示出了电子设备100的结构示意图。
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。
在本申请的一些实施例中,电子设备100对语音信息进行处理,得到特征信息的过程,以及电子设备200对来自电子设备100的特征信息进行处理,得到语音信息对应的操作信息的过程中涉及的部分或全部数据处理也可在电子设备100中的处理器110中实现。电子设备100也称为第一终端,电子设备200也称为第二终端。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例 如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G/6G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜 头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备 100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。
示例性的,上述仅以电子设备100举例说明本申请实施例中电子设备的结构,但并不构成对电子设备结构、形态的限制。本申请实施例对电子设备的结构、形态不做限制。示例性的,图5示出了电子设备的另一种示例性结构。如图5所示,电子设备包括:处理器501、存储器502、收发器503。处理器501、存储器502的实现可参见电子设备100的处理器、存储器的实现。收发器503,用于电子设备与其他设备(比如电子设备100)交互。收发器503可以是基于诸如Wi-Fi、蓝牙或其他通信协议的器件。
可选的,服务器的结构可参见图5所示结构,这里不再赘述。
在本申请另一些实施例中,电子设备或服务器可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者替换某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
以下实施例中所涉及的技术方案均可以在具有如图4、图5所示结构的装置中实现。
示例性的,以智能家居场景为例,如图6,手机中包括第一模型,各智能家居设备中包括第二模型。其中,第一模型由手机厂商训练部署在手机。第一模型,可用于获取语音信息对应的多维度的特征信息。本申请实施例中,第一模型的权重通常固定,无需频繁更新第一模型。
第二模型由各智能家居设备的厂商自行训练。第二模型,可用于将语音信息对 应的多维度的特征信息转换为相应的操作信息。本申请实施例中,对于第二模型,各智能家居设备的厂商可以根据实际使用需要进行更新。也就是说,在后期新增设备的情况下,通常只需该新增设备的厂商重新训练该新增设备中用于识别操作信息(比如对控制指令进行分类)的第二模型,手机厂商无需频繁更新用于识别特征信息的第一模型,如此,能够降低诸如手机等厂商的模型训练、维护成本。并且,由于第二模型只涉及到识别具体设备的操作信息(比如对控制指令进行分类),模型较小,方便训练更新。
可选的,更新第二模型包括:更新第二模型的权重。
需要说明的是,由于不同的智能家居设备可能来自不同设备厂商,各厂商训练第二模型采用的算法可能不同,因此,不同智能家居设备上的第二模型可能不同。
在图6所示场景中,用户向手机输入语音信息“调高电视的音量”,手机检测到用户输入的语音信息后,可以将语音信息输入第一模型,由第一模型输出语音信息的特征信息。可选的,语音信息的特征信息可以以诸如但不限于特征矩阵形式输出。获得特征信息后,手机可以将特征信息发送给与手机连接的各智能家居设备(比如图6所示台灯、空调、电视)。
智能家居设备从手机接收特征信息之后,可以将特征信息输入第二模型,由第二模型输出语音信息对应的操作信息。如图6所示,电视能够根据第二模型识别出操作信息(控制指令的分类结果)是调高音量,这样一来,电视能够根据识别出的操作信息,执行该操作信息对应的操作,即调整音量。台灯无法识别操作信息,或者说台灯通过第二模型输出的操作信息与自身不匹配,或者说台灯通过第二模型输出的操作信息与自身可执行的操作信息(控制指令)不匹配,如此,台灯根据第二模型输出的操作信息(比如控制指令)可确定,用户的语音信息不是用于控制自身的,台灯不必响应用户的语音信息。类似的,空调不必响应用户的语音信息。
与现有技术中,手机需完成从特征信息提取到操作信息识别的过程,导致手机的计算量大,语音控制的效率低相比,在上述智能家居设备的语音控制场景中,将特征信息提取与操作信息识别过程解耦。其中,特征信息提取流程由手机执行,操作信息识别由各智能家居设备执行。相较于现有技术,手机不再执行操作信息识别的操作,计算量有所降低,能够提升手机的运行速度,进而提高语音控制的效率。
如下,介绍本申请实施例的语音控制过程中各设备之间的具体交互。如图7所示,以用户通过手机语音控制调高电视的音量为例,本申请实施例提供的语音控制方法包括:
S101、手机检测到用户输入语音信息。
示例性的,用户输入的语音信息是“调高电视的音量”。
S102、手机将语音信息转化为特征信息。
通常,语音信息属于模拟信号,手机需要通过编码模型将其转化成数字信号,并提取特征信息,后续,其他设备能够根据提取的特征信息识别语音信息对应的操作信息。
其中,特征信息指从语音信息中得到的具有辨识性的成分,通过这些具有辨识性的语音成分可以准确描述一段语音与其他语音之间区别。可选的,语音信息中具有辨识性的成分包括但不限于声谱、声谱的音素。声谱的音素包括但不限于声谱中的共振峰。本申请实施例的特征信息不局限于上述列举的几种,本申请实施例限于篇幅不再穷举全部特征信息,凡是语音中能够起到辨识作用的信息,都可称为特征信息。
作为一种可能的实现方式,手机包括第一模型,手机可以将语音信息输入第一模型,由第一模型计算并输出语音信息对应的特征信息。其中,第一模型的训练方式,可参见下述实施例。
本申请实施例中,第一模型可以实现为编码模型(还可以称为编码模块,编码神经网络,或有其他名称),名称不构成对编码模型的限制。编码模型可视为手机上的功能模块,该功能模块用于将语音信息对应的信息转化为语音的特征信息。
或者,可选的,第一模型中除了包括编码模型,还可以集成其他模型,比如VAD模型。可选的,第一模型还可以集成其他功能,本申请实施例对第一模型是否集成其他功能以及其他功能的具体类型不做限定。
类似的,本申请实施例中的第二模型可以实现为解码模型(还可以称为解码模块,解码神经网络,或有其他名称)。解码模型也可视为设备(比如电视)中的功能模块,该功能模块用于将语音的特征信息转化为相应的操作信息。
可选的,第二模型中除了包括解码模型,还可以集成其他模型或模块,本申请实施例对第二模型是否集成其他功能以及其他功能的具体类型不做限定。
本申请实施例中,第一模型,还可以称为第一模型文件,或其他名称。第二模型,还可以称为第二模型文件,或其他名称。名称并不构成对第一模型、第二模型的限制。
S103、手机广播特征信息。
相应的,电视等手机连接的各设备从手机接收特征信息。
在本申请实施例中,手机不执行操作信息识别的操作,而是由手机控制的各设备完成操作信息识别,因此,手机并不知道用户的语音信息是用于控制哪个设备。那么,手机需向所连接的全部设备广播特征信息,由其他设备各自识别特征信息对应的操作信息后,判断用户的语音信息是否用于控制自身执行操作,若是,则响应用户的语音信息,执行语音信息对应的操作,若否,则不响应用户的语音信息,不执行语音信息对应的操作。
S104、电视将特征信息转化为电视对应的操作信息。
作为一种可能的实现方式,电视包括第二模型。电视从手机接收到语音的特征信息(比如语音的特征矩阵)之后,将语音的特征信息输入第二模型,由第二模型确定并输出语音信息对应的操作信息。示例性的,在图6所示场景中,电视将语音的特征信息(比如特征矩阵)输入第二模型,第二模型计算并确定该特征信息对应的操作信息为“调高电视的音量”。
S105、电视响应操作信息,执行操作信息对应的操作。
示例性的,仍如图6所示场景,电视识别出用户的语音信息对应的操作信息是“调 高电视的音量”,且该操作信息是电视匹配的操作信息之后,可响应该操作信息,执行该操作信息对应的目标操作,即调高音量。
接下来,结合设备内部的功能模块阐述语音控制方法中设备之间的交互。如图8所示,本申请实施例的语音控制方法包括:
S201、手机检测到语音信息,并将语音信息输入VAD模型。
示例性的,用户向手机输入语音信息“调高电视的音量”,手机检测到该语音信息之后,将语音信息对应的语音信息输入VAD模型。
S202、VAD模型检测语音信息中的人声信息,并将语音信息中的人声信息输入手机的编码模型。
考虑到用户输入语音信息时,手机采集语音信息时可能也同时采集了环境中的其他声音,因此,为了降低后续计算过程的数据处理量,同时避免环境噪音的干扰,手机可以通过VAD模型识别所采集语音信息中的人声信息和非人声信息(noise)。其中,VAD模型可以是能够执行语音分类任务的任意类型模型。
可选的,可以将采集的原始语音信息切分成多个段(分帧),例如20ms或者25ms的分帧,并将语音信息输入VAD模型,由VAD输出语音信息的分类结果。可选地,VAD模型输出各分帧属于人声或非人声的分类结果,并将属于人声的语音作为后续编码模型的输入。
其中,本申请实施例涉及的VAD模型的训练过程,可参见现有技术,这里不再赘述。
本申请实施例中,VAD模型也可视为手机上的功能模块,该功能模块具有识别人声和非人声的功能。
S203、编码模型根据语音信息中的人声信息,输出语音信息对应的特征信息。
可选的,编码模型将人声信息划分成多个帧,对于得到的每一帧,编码模型按照一定规则(比如但不限于人耳听声的(mel frequency cepstrum coefficient,MFCC)规则),提取人声信息的特征信息。可选的,编码模型可以将提取的特征信息转换成特征向量。
示例性的,给出编码模型提取特征信息的方式。首先,对语音信息进行预处理。预处理包括但不限于:将语音信息划分为多个分帧。之后,对每一个分帧,执行下述操作:
通过快速傅里叶变换(fast fourier transform,FFT)得到分帧对应的频谱,并通过Mel滤波器组对得到的频谱进行处理,得到分帧对应的Mel频谱。如此,能够将线形的自然频谱转换为体现人类听觉特性的Mel频谱。接下来,对该分帧对应的Mel频谱进行倒谱分析,获得对应的MFCC,该MFCC可作为该分帧的语音信息对应的特征信息。
在获得语音的每个分帧的特征信息之后,可以组合各分帧的特征信息,得到语音信息对应的特征信息(比如特征向量)。
需要说明的是,编码模型提取特征信息的方法还可以为其他,并不局限于上述列举的方式。
S204、手机的通信模块获得语音信息对应的特征信息。
可选的,通信模块用于支持手机与其他电子设备通信。例如,通信模块可经由无线通信或有线通信连接到网络,以与其他个人终端或网络服务器进行通信。无线通信可采用蜂窝通信协议中的至少一个,诸如,5G、长期演进(LTE)、高级长期演进(LTE-A)、码分多址(CDMA)、宽带码分多址(WCDMA)、通用移动通信系统(UMTS)、无线宽带(WiBro)或全球移动通信系统(GSM)。无线通信可包括例如短距通信。短距通信可包括无线保真(Wi-Fi)、蓝牙、近场通信(NFC)、磁条传输(MST)或GNSS中的至少一个。
作为一种可能的实现方式,手机中的处理模块(比如处理器)可以获得上述编码模块的输出结果,即获得语音信息对应的特征信息,并可将语音信息对应的特征信息发给通信模块,由手机的通信模块执行步骤下述S205。
S205、手机的通信模块广播语音信息对应的特征信息。
相应的,电视的通信模块从手机接收语音的特征信息。
S206、电视的解码模型获得语音信息对应的特征信息。
作为一种可能的实现方式,电视的通信模块从手机接收语音信息对应的特征信息之后,向电视的处理模块发送该特征信息,处理模块将该特征信息输入解码模型。
S207、电视的解码模型根据语音信息对应的特征信息,输出该特征信息对应的操作信息。
可选的,解码模型可以是用于执行分类任务的模型,其输出内容为语音信息对应的操作信息。
需要注意的是,本申请实施例中的解码模型与传统ASR中的decoder不同,传统ASR中的decoder可以将语音信息对应的特征信息转换为文字,再由后续功能模块将文字转化为对应的操作信息,本申请实施例中的解码模型可以将语音信息对应的特征信息转换分类为对应的操作信息。可见,本申请实施例中的解码模型的转换效率更高。
S208、判断解码模型输出的操作信息是否为电视匹配的操作信息。若是,则执行下述步骤S209,若否,则执行S210。
S209、响应该操作信息,执行该操作信息对应的操作。
示例性的,仍如图6所示场景,电视从手机接收语音信息对应的特征信息(比如特征矩阵)之后,通过第二模型(比如解码模型)输出操作信息(比如控制指令)“调高电视的音量”,并根据该操作信息,执行该操作信息对应的操作,即调高音量。
S210、不响应该操作信息,不执行该操作信息对应的操作。
示例性的,仍如图6所示场景,空调从手机接收语音信息对应的特征信息之后,通过第二模型(比如解码模型)输出语音信息对应的操作信息“其他(others)类型的操作信息”,该操作信息(比如控制指令)表示用户的语音信息不是用于控制空调的。那么,空调根据该操作信息,不执行对应的操作。类似的,台灯接收到语音信息对应的特征信息之后,根据特征信息输出对应的操作信息,并确定无需执行相应操作。
通过在语音信息的控制设备(比如手机)端包括编码神经网络,在语音信息的受控设备(比如家居设备)端包括解码神经网络,也就是能够将解码神经网络的训练交由各个第三方厂商执行。不同厂商可训练各自的解码神经网络。一方面,无须频繁训练编码神经网络,极大降低了新增设备后的开发成本,另一方面,由于手机侧仅执行 语音识别中的提取特征信息的操作,而不再执行操作信息识别的流程,因此,可以降低手机的运算量以及功耗,提升运算速度,进而降低语音识别流程的延时。
如下,介绍上述第一模型、第二模型的训练方法。所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的。所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
图9示例性给出了第一模型的一种训练方法。如图9的(1)所示,首先训练用于识别操作信息的模型,需要提供N(N为正整数)个训练样本,训练样本包括操作信息已知的语音信息(即第一语音信息)。语音数据的类型可以为多个,以保证语料足够丰富,提升识别准确率。可选的,训练样本还包括语音数据的标签,用于表征语音信息对应的操作信息,对多个样本进行训练即可得到能够用于提取特征信息以及识别语音信息对应的操作信息的模型。该模型能够输出语音信息对应的操作信息。
如图9的(1)所述训练模型的场景中,训练的模型包括32层神经元。其中,L1-L16层用于提取语音信息对应的特征信息,L17-L32层用于识别语音信息对应的操作信息。对于某层的神经元来说,其可以与下一层中的一个或多个神经元连接,并通过连接输出相应信号。如图9的(1)示出了所训练模型中部分神经元之间连接对应的权重,比如,L1层的第一个神经元与L2层的第一个神经元之间的连接,对应权重w11,L1层的第一个神经元与L2层的第二个神经元之间的连接,对应权重w12,以此类推。
可选的,为了提升模型的识别准确率,可以对模型进行评估、测试。当模型的识别率达到一定阈值,说明该模型已训练好。当模型的识别率较低,可以继续训练模型,直至模型的识别准确率达到一定阈值。
可选的,模型的训练过程可以在端侧(比如手机等终端)或云侧(比如服务器)。训练可以是离线训练或在线训练。本申请实施例对模型的具体训练方式不做限制。
如图9的(2)所示,在训练好用于提取特征信息以及识别操作信息(比如对控制指令进行分类)的完整模型之后,从该完整模型中移除用于识别操作信息(比如识别控制指令)的L17-L32层对应的部分,即可得到用于提取特征信息的模型。如图9的(2)所示,用于提取特征信息的第一模型包括L1-L16这16个层,在对第一模型输入语音数据(也称为语音信息)之后,第一模型可以输出语音数据对应的特征向量(也称为特征信息)。
总结来说,以第一模型为encoder,第二模型为decoder为例,本申请实施例的encoder是训练好的encoder-decoder模型中的encoder部分,相当于将一个encoder-decoder模型中的encoder部分抽离出来,形成第一模型。
如图10示例性给出了第二模型的一种训练以及使用方法。如图10的(1)所示,在训练第二模型之前,先获得第一模型。作为一种可能的实现方式,若用于提取语音信息对应的特征信息的第一模型由手机训练,则手机可以将第一模型上传至服务器。后续,其他设备可以从服务器获得第一模型,并根据第一模型训练 第二模型。或者,设备可以通过其他方式获得手机中的第一模型,本申请实施例并不限制设备获得第一模型的具体方式。
作为一种可能的实现方式,训练第二模型时,将第一模型的输出作为第二模型的输入,形成神经网络进行训练。其中,第一模型的输入作为整个神经网络的输入,第二模型的输出作为整个神经网络的输出,训练过程中,第一模型的权重保持不变。比如,如图10的(1),可以将第一模型的输出(即语音信息对应的特征信息)作为训练样本,并根据训练样本训练第二模型。训练好的第二模型,具有根据输入的特征信息输出操作信息的功能。以电视通过第二模型识别语音信息对应的操作信息为例,如图10的(2)所示,电视可以将操作信息未知的语音信息对应的特征信息(比如从手机接收的特征向量)输入第二模型,进而由第二模型输出语音信息对应的操作信息(比如调高电视的音量)。
在另一些实施例中,设备也可以单独训练第二模型,即在无需获得第一模型的情况下,训练第二模型。在该实现方式中,训练样本同样是语音信息对应的特征向量(第一特征信息的一种示例),对训练样本进行训练得到第二模型。
不局限于语音控制场景,其他分布式任务处理场景均可适用该方法。示例性的,分布式任务处理场景包括但不限于:远程会议场景(包括但不限于实时翻译场景)、人脸识别验证场景。
在人脸识别验证场景,以通过手机进行人脸识别为例,示例性的,如图11所示,可以将人脸识别模型至少拆分为第一模型和第二模型。手机的摄像模组(比如摄像头)包括第一模型。第一模型用于提取人脸图像中的人脸的特征信息。手机的比如处理模块中包括第二模型。第二模型用于根据人脸的特征信息输出人脸的识别结果。
如图12示出了在人脸识别场景中本申请实施例方法的示例性流程,该流程包括如下步骤:
S301、摄像模组采集用户输入的人脸图像。
S302、摄像模组将人脸图像输入第一模型,第一模型输出人脸图像的特征信息。
S303、摄像模组将人脸图像的特征信息传递给处理模块。
S304、处理模块将人脸图像的特征信息输入第二模型,第二模型输出人脸的识别结果。
S305、处理模块根据人脸的识别结果,判断人脸是否为合法人脸。若是,则执行S306,若否,则执行S307。
S306、执行人脸信息对应的操作。
示例性的,在支付场景中,用户输入人脸图像,手机通过摄像模组中的第一模型、处理模块中的第二模型判断人脸为合法人脸时,将执行支付操作。在屏幕解锁场景中,用户输入人脸图像,手机通过摄像模组中的第一模型、处理模块中的第二模型判断人脸为合法人脸时,将解锁屏幕。
S307、不执行人脸信息相应的操作。
上述以摄像头为手机上的模块为例进行说明,在另一些场景中,第一模型所在的摄像头还可以在独立于手机之外的模块中,手机中包括第二模型。如此,手机的外部 摄像头可以和手机共同完成人脸识别过程,并且,由于已经将模型拆分为第一模型和第二模型,因此,可以提升人脸识别的效率。
类似的,在其他分布式智慧场景中,设备可以将用于执行一个或多个任务的一个或多个模型拆分为多个子模型,并将多个子模型部署在该设备的多个模块中,借助该多个模块分担单个模块的模型运行负载。本申请实施例并不限制模型的具体拆分方式,也不限制模型被拆分为多个子模型后具体分布式的部署在哪些模块中。
在远程会议实时翻译场景,可以将现有的模型至少拆分为第一模型和第二模型,在说话方设备运行第一模型,在接听方设备运行第二模型。如图13示出了在远程会议翻译场景中本申请实施例方法的示例性流程,该流程包括如下步骤:
S401、手机A的音频采集模块采集源语言(即第一语言)的第一语言信息,并将第一语音信息输入手机A的第一模型。
可选的,音频采集模块包括但不限于麦克风。以英译中为例,源语言的第一语音信息为英文“this meeting is”,手机A的音频采集模块采集说话者的该英文语音信息,并将英文语音信息输入第一模型。
S402、第一模型提取第一语音信息的特征信息。
示例性的,提取英文语音信息的特征信息,即英文语音“this meeting is”对应的特征信息。
S403、手机A的通信模块获得第一语音信息的特征信息。
作为一种可能的实现方式,手机A的通信模块从第一模型获得第一语音信息的特征信息,或者,手机A的通信模块从处理模块获得第一语音信息的特征信息。
S404、手机A的通信模块向手机B的通信模块发送第一语音信息的特征信息。
S405、手机B的第二模型获得第一语音信息的特征信息。
作为一种可能的实现方式,第二模型从通信模块获得第一语音信息的特征信息。或者,处理模块将特征信息输入第二模型,即第二模型从处理模块获得第一语音信息的特征信息。
S406、第二模型根据第一语音信息的特征信息,确定第一语音信息对应的目标语言(第二语言)的字幕信息和/或第二语音信息。
可选的,所述第一语言与所述第二语言不同或相同。
示例性的,在手机B开启语音翻译(比如英译中)功能的场景中,手机B在接收到第一语音信息的特征信息之后,可自动将特征信息输入第二模型,并通过第二模型,将英文语音信息对应的特征信息,翻译为中文字幕。比如,根据英文“this meeting is”输出对应的中文操作信息(比如控制指令)“此次会议是”的操作信息。
再比如,手机B没有开启跨语种翻译功能的场景中,第二模型根据英文语音信息对应的特征信息,输出英文的操作信息识别结果。比如,输出对应的英文操作信息“this meeting is”。
S407、手机B的处理模块控制显示第二语言的字幕信息,和/或播放第二语言的第二语音信息。
示例性的,处理模块控制显示模块显示翻译的中文字幕“此次会议是”,显示模块还可以显示英文字幕“this meeting is”。再示例性的,处理模块控制音频输出模块(比 如扬声器)播放翻译的中文语音“此次会议是”,扬声器还可以播放英文语音“this meeting is”。
在远程会议翻译场景中,手机B仅需运行源语言的特征信息-翻译结果的过程,无需运行源语言的语音信息-源语言的特征信息(即提取特征信息)这一阶段的流程,降低了手机B的运算量,能够提升翻译效率。
远程会议场景、人脸识别场景中第一模型、第二模型的训练方法可参见图9、图10的模型训练方法,这里不再赘述。在一种可能的设计中,所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的,和/或,所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
可见,通过本申请实施例中的技术方案,可以将较为复杂的参数模型(包括但不限于机器学习模型)、非参数模型拆分为更小粒度的多个子模型,并且将多个子模型分别运行在同一设备的不同模块中,或者,分别运行在同一组网中的不同设备(如上述语音识别场景)中,或者,分别运行在不同组网的多个设备(如上述远程会议场景)中。如此,能够使得降低单个模块或设备上的运算量,从而提升整个任务处理流程的处理效率。其中,本申请实施例对子模型的拆分粒度、拆分方式、以及拆分后部署在哪些模块或设备不做限定,可以按照场景、设备类型等特点灵活确定。
并且,上述仅列举了几种可能的应用场景,本申请实施例的技术方案还可以应用在其他场景中,限于篇幅,这里不再穷举所有可能场景。示例性的,本申请实施例可以应用在骨声纹识别场景。目前,生物识别技术之一的骨声纹技术,在识别率、速度、便捷程度方面均较高。其识别人物身份的原理是:采集人物的语音信息,并根据语音信息对人物身份的合法性进行验证。其中,由于每个人的骨结构都是独一无二的,所以声音在骨骼间的反射回音也是独一无二的。声音在骨骼间的反射回音可称为骨声纹,与指纹可以用来辨识不同人物的原理类似,骨声纹可以用来识别不同用户的身份。
本申请实施例中,在骨声纹识别场景中,可以将用于骨声纹识别的模型拆分为两部分,其中一部分(第一模型)设置在诸如蓝牙耳机中,另外部分(第二模型)设置在手机中。耳机采集到用户的声音信息(比如用户输入“解锁屏幕”)后,可通过设置的第一模型提取声音信号(也称为语音信息)的特征信息,并将该特征信息发送给手机,手机通过第二模型识别声音是否为合法用户的声音,若是,则执行相应操作(比如解锁屏幕)。
图14示出了本申请实施例提供的分布式语音控制方法的流程。该方法应用于第一终端,该方法包括:
S1401、第一终端响应于用户输入的语音信息,将语音信息输入第一模型,并通过第一模型获得语音信息对应的特征信息。
其中,第一模型存在于第一终端,第二模型存在于第二终端。
示例性的,以第一终端为手机为例,如图6所示,手机接收用户输入的语音 信息“调高电视的音量”,并通过第一模型输出该语音信息的特征信息(即特征矩阵)。
S1402、第一终端向第二终端发送特征信息,以使得第二终端将特征信息输入第二模型,并通过第二模型确定语音信息对应的操作信息,以及根据操作信息执行相应操作。
仍以图6为例,第二终端包括与手机连接的台灯、空调、电视。手机在获取到语音信息对应的特征信息后,将特征信息广播给台灯、空调、电视。台灯、空调、电视通过第二模型识别操作信息(比如控制指令)。其中,电视识别的操作信息与电视匹配,则电视执行“调高电视音量”这一操作信息对应的目标操作,即调高自身的播放音量。
需要说明的是,上述各方法实施例的流程中的一些操作任选地被组合,并且/或者一些操作的顺序任选地被改变。并且,各流程的步骤之间的执行顺序仅是示例性的,并不构成对步骤之间执行顺序的限制,各步骤之间还可以是其他执行顺序。并非旨在表明所述执行次序是可以执行这些操作的唯一次序。本领域的普通技术人员会想到多种方式来对本文所述的操作进行重新排序。另外,应当指出的是,本文结合本文所述的其他方法(例如,图7对应的方法、图8对应的方法)所述的其他过程的细节同样以类似的方式适用于上文结合图12所述的方法。
或者,方法实施例中的某些步骤可等效替换成其他可能的步骤。或者,方法实施例中的某些步骤可以是可选的,在某些使用场景中可以删除。或者,可以在方法实施例中增加其他可能的步骤。
本申请另一些实施例提供了一种装置,该装置可以是上述电子设备(比如折叠屏手机)。该装置可以包括:显示屏、存储器和一个或多个处理器。该显示屏、存储器和处理器耦合。该存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令。当处理器执行计算机指令时,电子设备可执行上述方法实施例中手机执行的各个功能或者步骤。该电子设备的结构可以参考图4或图5所示的电子设备。
其中,该电子设备的核心结构可以表示为图15所示的结构,该核心结构可包括:处理模块1301、输入模块1302、存储模块1303、显示模块1304。图15的组件仅是示例性的,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理模块1301,可包括中央处理器(CPU)、应用处理器(Application Processor,AP)或通信处理器(Communication Processor,CP)中的至少一个。处理模块1301可执行与用户电子设备的其他元件中的至少一个的控制和/或通信相关的操作或数据处理。具体地,处理模块1301可用于根据一定的触发条件,控制主屏上显示的内容。或者根据预设规则确定屏幕上显示的内容。处理模块1301还用于将输入的指令或数据进行处理,并根据处理后的数据确定显示样式。
在本申请实施例中,若图15所示结构为第一电子设备(第一终端)或芯片系统,处理模块1301,用于响应于用户输入的语音信息,将所述语音信息输入第一模型,并通过所 述第一模型获得所述语音信息对应的特征信息。
在本申请实施例中,若图15所示结构为第二电子设备(第二终端)或芯片系统,处理模块1301,用于将所述特征信息输入第二模型,并通过所述第二模型确定所述语音信息对应的操作信息;
处理模块,用于根据所述操作信息执行相应操作。
在一种可能的设计中,所述第二终端根据所述操作信息执行相应操作,包括:
若确定所述语音信息对应的操作信息为所述第二终端匹配的操作信息,则所述第二终端根据所述语音信息对应的操作信息执行目标操作,和/或,若确定所述语音信息对应的操作信息不是所述第二终端匹配的操作信息,则所述第二终端丢弃所述操作信息。
输入模块1302,用于获取用户输入的指令或数据,并将获取到的指令或数据传输到电子设备的其他模块。具体地说,输入模块1302的输入方式可以包括触摸、手势、接近屏幕等,也可以是语音输入。例如,输入模块可以是电子设备的屏幕,获取用户的输入操作并根据获取到的输入操作生成输入信号,将输入信号传输至处理模块1301。
存储模块1303,可包括易失性存储器和/或非易失性存储器。存储模块用于存储用户终端设备的其他模块中的至少一个相关的指令或数据,具体地说,存储模块可存储第一模型、第二模型。
显示模块1304,可包括例如液晶显示器(LCD)、发光二极管(LED)显示器、有机发光二极管(OLED)显示器、微机电系统(MEMS)显示器或电子纸显示器。用于显示用户可观看的内容(例如,文本、图像、视频、图标、符号等)。
可选的,图15所示结构还可以包括输出模块(未在图15中示出)。输出模块可用于输出信息。示例性的,播放、输出语音信息。输出模块包括但不限于扬声器等模块。
可选的,图15所示结构还可通信模块1305,用于支持电子设备与其他电子设备通信。例如,通信模块可经由无线通信或有线通信连接到网络,以与其他个人终端或网络服务器进行通信。无线通信可采用蜂窝通信协议中的至少一个,诸如,5G、长期演进(LTE)、高级长期演进(LTE-A)、码分多址(CDMA)、宽带码分多址(WCDMA)、通用移动通信系统(UMTS)、无线宽带(WiBro)或全球移动通信系统(GSM)。无线通信可包括例如短距通信。短距通信可包括无线保真(Wi-Fi)、蓝牙、近场通信(NFC)、磁条传输(MST)或GNSS中的至少一个。
在本申请实施例中,若图15所示结构为第一电子设备或芯片系统,通信模块1305,用于向所述第二终端发送所述特征信息。
可选的,向所述第二终端发送所述特征信息,包括:广播所述特征信息。
在本申请实施例中,若图15所示结构为第二电子设备或芯片系统,通信模块1305,用于从所述第一终端接收语音信息对应的特征信息。
需要说明的是,本申请方法实施例中的各步骤的描述均可援引到装置对应的模块,这里不再赘述。
本申请实施例还提供一种芯片系统,如图16所示,该芯片系统包括至少一个处理器1401和至少一个接口电路1402。处理器1401和接口电路1402可通过线路互联。例如,接口电路1402可用于从其它装置(例如电子设备的存储器)接收信号。又例如,接口电 路1402可用于向其它装置(例如处理器1401)发送信号。示例性的,接口电路1402可读取存储器中存储的指令,并将该指令发送给处理器1401。当指令被处理器1401执行时,可使得电子设备执行上述实施例中的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请实施例还提供一种计算机存储介质,该计算机存储介质包括计算机指令,当计算机指令在上述电子设备上运行时,使得该电子设备执行上述方法实施例中手机执行的各个功能或者步骤。
本申请实施例还提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述方法实施例中手机执行的各个功能或者步骤。
通过以上实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种分布式语音控制方法,其特征在于,所述方法包括:
    第一终端响应于用户输入的语音信息,将所述语音信息输入第一模型,并通过所述第一模型获得所述语音信息对应的特征信息,所述第一模型存在于所述第一终端;
    所述第一终端向第二终端发送所述特征信息,以使得所述第二终端将所述特征信息输入第二模型,并通过所述第二模型确定所述语音信息对应的操作信息,以及根据所述操作信息执行相应操作,所述第二模型存在于所述第二终端。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/或,
    所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一终端向第二终端发送所述特征信息,包括:所述第一终端广播所述特征信息。
  4. 一种分布式语音控制方法,其特征在于,所述方法包括:
    第二终端从第一终端接收语音信息对应的特征信息;所述特征信息是所述第一终端将所述语音信息输入第一模型,并通过所述第一模型获得的,所述第一模型存在于所述第一终端;
    所述第二终端将所述特征信息输入第二模型,并通过所述第二模型确定所述语音信息对应的操作信息,所述第二模型存在于所述第二终端;
    所述第二终端根据所述操作信息执行相应操作。
  5. 根据权利要求4所述的方法,其特征在于,所述第二终端根据所述操作信息执行相应操作,包括:
    若确定所述语音信息对应的操作信息为所述第二终端匹配的操作信息,则所述第二终端根据所述语音信息对应的操作信息执行目标操作;和/或,
    若确定所述语音信息对应的操作信息不是所述第二终端匹配的操作信息,则所述第二终端丢弃所述操作信息。
  6. 根据权利要求4或5所述的方法,其特征在于,
    所述第一模型是基于至少一个第一样本数据训练得到的模型,第一样本数据包括:第一语音信息,所述第一语音信息的特征信息是已知的;和/或,
    所述第二模型是基于至少一个第二样本数据训练得到的模型,第二样本数据包括:第一特征信息,所述第一特征信息对应的操作信息是已知的。
  7. 一种第一终端,其特征在于,包括:
    显示屏;
    一个或多个处理器;
    一个或多个存储器;
    所述存储器存储有一个或多个程序,当所述一个或者多个程序被所述处理器执行时,使得所述第一终端执行如权利要求1至3中任一项所述的方法。
  8. 一种第二终端,其特征在于,包括:
    显示屏;
    一个或多个处理器;
    一个或多个存储器;
    所述存储器存储有一个或多个程序,当所述一个或者多个程序被所述处理器执行时,使得所述第二终端执行如权利要求4至6中任一项所述的方法。
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,当所述计算机指令在终端上运行时,使得所述终端执行如权利要求1至3中任一项所述的方法,或者,执行如权利要求4至6中任一项所述的方法。
  10. 一种计算机程序产品,其特征在于,当所述计算机程序产品在终端上运行时,使得所述终端执行如权利要求1至3中任一项所述的方法,或者,执行如权利要求4至6中任一项的方法。
PCT/CN2022/116804 2021-10-22 2022-09-02 分布式语音控制方法及电子设备 WO2023065854A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111234615.7A CN116030790A (zh) 2021-10-22 2021-10-22 分布式语音控制方法及电子设备
CN202111234615.7 2021-10-22

Publications (1)

Publication Number Publication Date
WO2023065854A1 true WO2023065854A1 (zh) 2023-04-27

Family

ID=86058787

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116804 WO2023065854A1 (zh) 2021-10-22 2022-09-02 分布式语音控制方法及电子设备

Country Status (2)

Country Link
CN (1) CN116030790A (zh)
WO (1) WO2023065854A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104681025A (zh) * 2013-11-26 2015-06-03 现代摩比斯株式会社 利用语音识别的命令执行系统及其工作方法
CN106537493A (zh) * 2015-09-29 2017-03-22 深圳市全圣时代科技有限公司 语音识别系统及方法、客户端设备及云端服务器
CN109949808A (zh) * 2019-03-15 2019-06-28 上海华镇电子科技有限公司 兼容普通话和方言的语音识别家电控制系统和方法
CN110503952A (zh) * 2019-07-29 2019-11-26 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111841007A (zh) * 2020-07-29 2020-10-30 网易(杭州)网络有限公司 游戏的控制方法、装置、设备和存储介质
CN112384933A (zh) * 2018-09-19 2021-02-19 国际商业机器公司 编码器-解码器存储器增强神经网络架构
CN113205802A (zh) * 2021-05-10 2021-08-03 芜湖美的厨卫电器制造有限公司 语音识别模型的更新方法、家用电器及服务器
KR20210103372A (ko) * 2020-02-13 2021-08-23 고려대학교 산학협력단 대화형 뇌-컴퓨터 인터페이스 기반 스마트 홈 제어 방법 및 서버

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104681025A (zh) * 2013-11-26 2015-06-03 现代摩比斯株式会社 利用语音识别的命令执行系统及其工作方法
CN106537493A (zh) * 2015-09-29 2017-03-22 深圳市全圣时代科技有限公司 语音识别系统及方法、客户端设备及云端服务器
CN112384933A (zh) * 2018-09-19 2021-02-19 国际商业机器公司 编码器-解码器存储器增强神经网络架构
CN109949808A (zh) * 2019-03-15 2019-06-28 上海华镇电子科技有限公司 兼容普通话和方言的语音识别家电控制系统和方法
CN110503952A (zh) * 2019-07-29 2019-11-26 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
KR20210103372A (ko) * 2020-02-13 2021-08-23 고려대학교 산학협력단 대화형 뇌-컴퓨터 인터페이스 기반 스마트 홈 제어 방법 및 서버
CN111841007A (zh) * 2020-07-29 2020-10-30 网易(杭州)网络有限公司 游戏的控制方法、装置、设备和存储介质
CN113205802A (zh) * 2021-05-10 2021-08-03 芜湖美的厨卫电器制造有限公司 语音识别模型的更新方法、家用电器及服务器

Also Published As

Publication number Publication date
CN116030790A (zh) 2023-04-28

Similar Documents

Publication Publication Date Title
US20220310095A1 (en) Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium
US11257493B2 (en) Vision-assisted speech processing
WO2021008538A1 (zh) 语音交互方法及相关装置
WO2020073248A1 (zh) 一种人机交互的方法及电子设备
CN114141230A (zh) 电子设备及其语音识别方法和介质
CN113539290B (zh) 语音降噪方法和装置
CN114242037A (zh) 一种虚拟人物生成方法及其装置
CN113488042B (zh) 一种语音控制方法及电子设备
CN114691839A (zh) 一种意图槽位识别方法
CN115312068B (zh) 语音控制方法、设备及存储介质
CN114090986A (zh) 一种公用设备上识别用户的方法及电子设备
WO2023065854A1 (zh) 分布式语音控制方法及电子设备
WO2023040658A1 (zh) 语音交互方法及电子设备
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN114999496A (zh) 音频传输方法、控制设备及终端设备
CN115083401A (zh) 语音控制方法及装置
CN115731923A (zh) 命令词响应方法、控制设备及装置
CN114238554A (zh) 一种文本标注提取方法
CN116030817B (zh) 语音唤醒方法、设备及存储介质
CN116052648B (zh) 一种语音识别模型的训练方法、使用方法及训练系统
WO2023078221A1 (zh) 语言翻译方法及电子设备
WO2023142757A1 (zh) 语音识别方法、电子设备及计算机可读存储介质
CN113903325B (zh) 文本转3d音频的方法及装置
WO2023231936A1 (zh) 一种语音交互方法及终端
WO2023098412A1 (zh) 字幕控制方法、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882469

Country of ref document: EP

Kind code of ref document: A1