WO2019233228A1 - Dispositif électronique, et procédé de commande de dispositif - Google Patents

Dispositif électronique, et procédé de commande de dispositif Download PDF

Info

Publication number
WO2019233228A1
WO2019233228A1 PCT/CN2019/085554 CN2019085554W WO2019233228A1 WO 2019233228 A1 WO2019233228 A1 WO 2019233228A1 CN 2019085554 W CN2019085554 W CN 2019085554W WO 2019233228 A1 WO2019233228 A1 WO 2019233228A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
preset
voiceprint feature
processing unit
integrated circuit
Prior art date
Application number
PCT/CN2019/085554
Other languages
English (en)
Chinese (zh)
Inventor
陈岩
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019233228A1 publication Critical patent/WO2019233228A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the technical field of electronic devices, and in particular, to an electronic device and a device control method.
  • voice recognition technology in electronic devices is becoming more and more widespread.
  • voice control of electronic devices can be achieved. For example, users can speak specific voice instructions to control electronic devices to take pictures and play music.
  • an embodiment of the present application provides an electronic device that includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than the power consumption of the central processor.
  • the application-specific integrated circuit chip is configured to obtain an external audio signal
  • the application-specific integrated circuit chip is further configured to perform a recognition operation on the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip is further configured to send instruction information indicating completion of the identification operation to the central processing unit;
  • the central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.
  • an embodiment of the present application provides a method for controlling a device, which is applied to an electronic device.
  • the electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the central processing unit. Processor power consumption.
  • the device control method includes:
  • the application-specific integrated circuit chip acquires an external audio signal
  • the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
  • the central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
  • FIG. 1 is a first schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 2 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 3 is a third schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 4 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a device control method according to an embodiment of the present application.
  • FIG. 6 is a detailed flowchart of identifying an audio signal by an application specific integrated circuit chip in the embodiment of the present application.
  • FIG. 7 is a detailed flowchart of a target operation performed by a central processing unit according to an embodiment of the present application.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • the electronic device 100 includes an application-specific integrated circuit chip 101 and a central processing unit 102, and the power consumption of the application-specific integrated circuit chip 101 is less than the power consumption of the central processing unit 102.
  • the ASIC chip 101 is used to obtain an external audio signal, perform a recognition operation on the acquired audio signal, obtain a recognition result, and send instruction information indicating completion of the recognition operation to the central processing unit 102.
  • the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption.
  • the ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
  • the application-specific integrated circuit chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the application-specific integrated circuit chip 101 can be paired with a built-in microphone (not shown in FIG. 1) of the electronic device. The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
  • the ASIC chip 101 when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal.
  • the digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz.
  • the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
  • the application specific integrated circuit chip 101 After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
  • the recognition mode of the ASIC chip 101 is configured as gender recognition
  • the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
  • the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.)
  • the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
  • the application-specific integrated circuit chip 101 After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102.
  • the function of the instruction information is to inform the central processing unit 102,
  • the ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101.
  • the foregoing indication information may be sent in the form of an interrupt signal.
  • the central processing unit 102 is configured to extract the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and execute a target operation corresponding to the foregoing recognition result.
  • the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
  • the central processing unit 102 After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male” is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female”, the theme mode of the operating system is switched to a feminine theme mode.
  • the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene” As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
  • the electronic device in the embodiment of the present application includes a central processing unit 102 and an application-specific integrated circuit chip 101.
  • the application-specific integrated circuit chip 101 with low power consumption obtains external audio signals, and performs identification operations on the acquired audio signals.
  • the central processor 102 sends the instruction information indicating the completion of the recognition operation to the central processor 102, and the central processor 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and executes the target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101.
  • the manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
  • the ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013.
  • the pre-processing unit 1012 is configured to extract the Mel frequency cepstrum coefficient of the audio signal using the Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit 1011;
  • the algorithm unit 1013 is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit 1011 to obtain candidate keywords and the confidence of the candidate keywords.
  • the micro control unit 1011 first obtains external audio signals through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sounds through a built-in microphone (not shown in FIG. 2) of the electronic device. An external audio signal is obtained. For another example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
  • the micro control unit 1011 when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
  • the micro control unit 1011 After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
  • the pre-processing unit 1012 After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
  • the micro-control unit 1011 After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
  • the algorithm unit 1013 After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
  • the ASIC chip 101 further includes a memory 1014 for storing the acquired audio signals, identifying candidate keywords, confidence levels, and the preprocessing unit 1012 and the algorithm unit 1013 during the execution process. Intermediate data generated in.
  • the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011.
  • the frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
  • the ASIC chip 101 further includes a cache memory 1015 for buffering data stored in the memory 1014 and data retrieved from the memory 1014.
  • the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed.
  • the cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
  • the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal
  • the pre-processing unit 1012 when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
  • the pre-processing unit 1012 pre-processes the audio signal before extracting the Mel-frequency cepstrum coefficient of the audio signal by using the Mel-frequency cepstrum coefficient algorithm, and after the pre-processing of the audio signal is used, The Mel frequency cepstrum coefficient algorithm extracts the Mel frequency cepstrum coefficient of the audio signal.
  • the pre-processing unit 1012 after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
  • pre-emphasis is to increase the energy of the high-frequency part of the audio signal.
  • the energy in the low-frequency part is often higher than the energy in the high-frequency part.
  • the spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part.
  • windowing Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing.
  • the window is described by three parameters: window length (in milliseconds), offset, and shape.
  • window length in milliseconds
  • offset offset
  • shape shape
  • Each windowed audio signal is called a frame
  • the milliseconds of each frame is called the frame length
  • the distance between the left borders of two adjacent frames is called the frame shift.
  • a Hamming window with edge smoothing down to 0 may be used for windowing processing.
  • the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
  • the pre-processing unit 1012 is further configured to extract a voiceprint feature of the audio signal before preprocessing the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and determine the voiceprint feature in the voiceprint feature. When matched with the preset voiceprint characteristics, the audio signal is pre-processed.
  • the voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
  • the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
  • the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
  • the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
  • the preprocessing unit 1012 After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature.
  • the preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
  • the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient.
  • the pre-processing unit 1012 is further configured to obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, determine whether the acquired similarity is greater than or equal to the first preset similarity, and When the similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature.
  • the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
  • the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained.
  • the micro control unit 1011 For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
  • the pre-processing unit 1012 is further configured to obtain current location information when the obtained similarity is less than the first preset similarity and greater than or equal to the second preset similarity, and determine the current location information based on the location information. Whether it is located within the preset position range, and when it is currently within the preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.
  • the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
  • the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
  • the preprocessing unit 1012 further obtains Current location information.
  • the pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information.
  • the preset position range can be configured as a common position range of the owner, such as home and company.
  • the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.
  • the central processing unit 102 is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence of the candidate keyword reaches a preset confidence level, according to a preset keyword and a preset operation.
  • a preset operation corresponding to a target keyword is determined as a target operation, and the target operation is performed.
  • the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
  • the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation.
  • the corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword “Little Europe, Little Europe” can be set to "wake the operating system", so when the target keyword is In “Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
  • An embodiment of the present application provides a device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than that of the central processing unit. Consumption, the device control method includes:
  • the application-specific integrated circuit chip acquires an external audio signal
  • the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
  • the central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit.
  • the application-specific integrated circuit chip identifies the audio signal and obtains a recognition result, including:
  • the preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;
  • the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .
  • the performing a target operation corresponding to the recognition result includes:
  • the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target.
  • the preset operation of the keyword is determined as the target operation, and the target operation is performed.
  • the application-specific integrated circuit chip further includes a memory
  • the device control method further includes:
  • the memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.
  • the application-specific integrated circuit chip further includes a cache memory
  • the device control method further includes:
  • the cache memory caches data stored in the memory and data fetched from the memory.
  • the method before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
  • the pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.
  • the method before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
  • the pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.
  • determining whether the voiceprint feature matches a preset voiceprint feature includes:
  • the preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.
  • the device control method provided in the embodiment of the present application further includes:
  • the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.
  • the device control method provided in the embodiment of the present application further includes:
  • the preprocessing unit instructs the micro-control unit to delete the audio signal.
  • an embodiment of the present application further provides a device control method.
  • the device control method is executed by an electronic device provided in the embodiment of the present application.
  • the electronic device includes an application specific integrated circuit chip 101 and a central processing unit 102, and the application specific integrated circuit chip.
  • the power consumption of 101 is smaller than the power consumption of the central processing unit 102. Please refer to FIG. 5.
  • the device control method includes:
  • the application specific integrated circuit chip 101 obtains an external audio signal.
  • the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption.
  • the ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
  • the ASIC chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the ASIC chip 101 may be The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
  • the ASIC chip 101 when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal.
  • the digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz.
  • the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
  • the application-specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal to obtain a recognition result.
  • the application specific integrated circuit chip 101 After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
  • the recognition mode of the ASIC chip 101 is configured as gender recognition
  • the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
  • the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.)
  • the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
  • the application specific integrated circuit chip 101 sends instruction information indicating completion of the identification operation to the central processing unit 102.
  • the application-specific integrated circuit chip 101 After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102.
  • the function of the instruction information is to inform the central processing unit 102,
  • the ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101.
  • the foregoing indication information may be sent in the form of an interrupt signal.
  • the central processing unit 102 extracts the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and performs a target operation corresponding to the foregoing recognition result.
  • the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
  • the central processing unit 102 After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male” is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female”, the theme mode of the operating system is switched to a feminine theme mode.
  • the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene” As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
  • the electronic device in the embodiment of the present application first obtains an external audio signal from the application-specific integrated circuit chip 101 with low power consumption, performs a recognition operation on the acquired audio signal, obtains a recognition result, and sends an instruction for the recognition operation.
  • the completed instruction information is sent to the central processing unit 102, and the central processing unit 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and performs a target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101.
  • the manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
  • the ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013. Referring to FIG. 6, the ASIC chip 101 performs a recognition operation on the acquired audio signal. Steps to get recognition results, including:
  • the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the control of the micro control unit 1011;
  • the algorithm unit 1013 uses the deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficients according to the control of the micro control unit 1011, and obtains the candidate keywords and the confidence of the candidate keywords.
  • the micro control unit 1011 first obtains an external audio signal through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sound through a built-in microphone (not shown in FIG. 2) of the electronic device to obtain an external audio signal. For example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
  • the micro control unit 1011 when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
  • the micro control unit 1011 After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
  • the pre-processing unit 1012 After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
  • the micro-control unit 1011 After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
  • the algorithm unit 1013 After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
  • the ASIC chip 101 further includes a memory 1014.
  • the memory 1014 can be used to store the acquired audio signals, identify candidate keywords, confidence, and during the execution of the preprocessing unit 1012 and the algorithm unit 1013 Generated intermediate data.
  • the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011.
  • the frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
  • the ASIC chip 101 further includes a cache memory 1015, which can be used to cache data stored in the memory 1014 and data retrieved from the memory 1014.
  • the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed.
  • the cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
  • the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal
  • the pre-processing unit 1012 when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
  • the central processing unit 102 executes a target operation corresponding to the foregoing recognition result, including:
  • the central processing unit 102 uses the candidate keywords as target keywords of the audio signal when the confidence of the candidate keywords reaches a preset confidence level;
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation, and executes the target operation.
  • the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
  • the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation.
  • the corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword “Little Europe, Little Europe” can be set to "wake the operating system", so when the target keyword is In “Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
  • the method further includes:
  • the pre-processing unit 1012 pre-processes the audio signal
  • the preprocessing unit 1012 After the preprocessing unit 1012 finishes preprocessing the audio signal, it uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the pre-processing unit 1012 after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
  • pre-emphasis is to increase the energy of the high-frequency part of the audio signal.
  • the energy in the low-frequency part is often higher than the energy in the high-frequency part.
  • the spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part.
  • windowing Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing.
  • the window is described by three parameters: window length (in milliseconds), offset, and shape.
  • window length in milliseconds
  • offset offset
  • shape shape
  • Each windowed audio signal is called a frame
  • the milliseconds of each frame is called the frame length
  • the distance between the left borders of two adjacent frames is called the frame shift.
  • a Hamming window with edge smoothing down to 0 may be used for windowing processing.
  • the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
  • the method before the step of preprocessing the audio signal by the preprocessing unit 1012, the method further includes:
  • the preprocessing unit 1012 extracts the voiceprint features of the audio signal
  • the preprocessing unit 1012 determines whether the extracted voiceprint features match the preset voiceprint features
  • the pre-processing unit 1012 pre-processes the aforementioned audio signal when the extracted voiceprint features match the preset voiceprint features.
  • the voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
  • the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
  • the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
  • the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
  • the preprocessing unit 1012 After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature.
  • the preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
  • the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient.
  • the step of the pre-processing unit 1012 determining whether the extracted voiceprint features match the preset voiceprint features includes:
  • the preprocessing unit 1012 obtains the similarity between the aforementioned voiceprint feature and the preset voiceprint feature
  • the preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity
  • the preprocessing unit 1012 determines that the obtained voiceprint feature matches the preset voiceprint feature.
  • the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
  • the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained.
  • the micro control unit 1011 For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
  • the method further includes:
  • the pre-processing unit 1012 obtains the current position information when the aforementioned similarity is less than the first preset similarity and greater than or equal to the second preset similarity;
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the obtained position information
  • the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
  • the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
  • the preprocessing unit 1012 further obtains Current location information.
  • the pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information.
  • the preset position range can be configured as a common position range of the owner, such as home and company.
  • the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un dispositif électronique, et un procédé de commande de dispositif. Le dispositif électronique comprend une unité centrale de traitement, et une puce de circuit intégré spécifique à l'application. Le procédé comprend les étapes suivantes : la puce de circuit intégré spécifique à l'application acquiert un signal audio externe (101) ; la puce de circuit intégré spécifique à l'application réal exécute une opération d'identification sur le signal audio acquis afin d'obtenir un résultat d'identification (102) ; la puce de circuit intégré spécifique à l'application envoie à l'unité centrale de traitement des informations d'indication indiquant que l'opération d'identification est accomplie (103) ; l'unité centrale de traitement extrait le résultat d'identification de la puce de circuit intégré spécifique à l'application d'après les informations d'indication reçues, et exécute une opération cible correspondant au résultat de l'identification (104).
PCT/CN2019/085554 2018-06-08 2019-05-05 Dispositif électronique, et procédé de commande de dispositif WO2019233228A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810589643.2 2018-06-08
CN201810589643.2A CN108711429B (zh) 2018-06-08 2018-06-08 电子设备及设备控制方法

Publications (1)

Publication Number Publication Date
WO2019233228A1 true WO2019233228A1 (fr) 2019-12-12

Family

ID=63871448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085554 WO2019233228A1 (fr) 2018-06-08 2019-05-05 Dispositif électronique, et procédé de commande de dispositif

Country Status (2)

Country Link
CN (1) CN108711429B (fr)
WO (1) WO2019233228A1 (fr)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108711429B (zh) * 2018-06-08 2021-04-02 Oppo广东移动通信有限公司 电子设备及设备控制方法
CN109636937A (zh) * 2018-12-18 2019-04-16 深圳市沃特沃德股份有限公司 语音考勤方法、装置及终端设备
CN110223687B (zh) * 2019-06-03 2021-09-28 Oppo广东移动通信有限公司 指令执行方法、装置、存储介质及电子设备
CN110310645A (zh) * 2019-07-02 2019-10-08 上海迥灵信息技术有限公司 智能控制系统的语音控制方法、装置和存储介质
CN111508475B (zh) * 2020-04-16 2022-08-09 五邑大学 一种机器人唤醒的语音关键词识别方法、装置及存储介质
CN113744117A (zh) * 2020-05-29 2021-12-03 Oppo广东移动通信有限公司 多媒体处理芯片、电子设备及动态图像处理方法
CN113352987B (zh) * 2021-05-31 2022-10-25 亿咖通(湖北)技术有限公司 一种控制车机警告音的方法及系统
CN115527373B (zh) * 2022-01-05 2023-07-14 荣耀终端有限公司 乘车工具识别方法及装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700368A (zh) * 2014-01-13 2014-04-02 联想(北京)有限公司 用于语音识别的方法、语音识别装置和电子设备
CN104143327A (zh) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 一种声学模型训练方法和装置
US20140372112A1 (en) * 2013-06-18 2014-12-18 Microsoft Corporation Restructuring deep neural network acoustic models
CN105488227A (zh) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 一种电子设备及其基于声纹特征处理音频文件的方法
CN106228240A (zh) * 2016-07-30 2016-12-14 复旦大学 基于fpga的深度卷积神经网络实现方法
CN106560891A (zh) * 2015-10-06 2017-04-12 三星电子株式会社 使用声学建模的语音识别设备和方法
CN107735803A (zh) * 2015-06-25 2018-02-23 微软技术许可有限责任公司 用于深度学习应用的存储器带宽管理
US10089979B2 (en) * 2014-09-16 2018-10-02 Electronics And Telecommunications Research Institute Signal processing algorithm-integrated deep neural network-based speech recognition apparatus and learning method thereof
CN108711429A (zh) * 2018-06-08 2018-10-26 Oppo广东移动通信有限公司 电子设备及设备控制方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005181510A (ja) * 2003-12-17 2005-07-07 Toshiba Corp Icボイスリピータ
CN102905029A (zh) * 2012-10-17 2013-01-30 广东欧珀移动通信有限公司 一种手机及智能语音寻找手机的方法
CN103474071A (zh) * 2013-09-16 2013-12-25 重庆邮电大学 嵌入式便携语音控制器及语音识别的智能家居系统
CN105575395A (zh) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 语音唤醒方法及装置、终端及其处理方法
CN106940998B (zh) * 2015-12-31 2021-04-16 阿里巴巴集团控股有限公司 一种设定操作的执行方法及装置
CN106250751B (zh) * 2016-07-18 2019-09-17 青岛海信移动通信技术股份有限公司 一种移动设备及调整体征信息检测阈值的方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372112A1 (en) * 2013-06-18 2014-12-18 Microsoft Corporation Restructuring deep neural network acoustic models
CN104143327A (zh) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 一种声学模型训练方法和装置
CN103700368A (zh) * 2014-01-13 2014-04-02 联想(北京)有限公司 用于语音识别的方法、语音识别装置和电子设备
US10089979B2 (en) * 2014-09-16 2018-10-02 Electronics And Telecommunications Research Institute Signal processing algorithm-integrated deep neural network-based speech recognition apparatus and learning method thereof
CN107735803A (zh) * 2015-06-25 2018-02-23 微软技术许可有限责任公司 用于深度学习应用的存储器带宽管理
CN106560891A (zh) * 2015-10-06 2017-04-12 三星电子株式会社 使用声学建模的语音识别设备和方法
CN105488227A (zh) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 一种电子设备及其基于声纹特征处理音频文件的方法
CN106228240A (zh) * 2016-07-30 2016-12-14 复旦大学 基于fpga的深度卷积神经网络实现方法
CN108711429A (zh) * 2018-06-08 2018-10-26 Oppo广东移动通信有限公司 电子设备及设备控制方法

Also Published As

Publication number Publication date
CN108711429A (zh) 2018-10-26
CN108711429B (zh) 2021-04-02

Similar Documents

Publication Publication Date Title
WO2019233228A1 (fr) Dispositif électronique, et procédé de commande de dispositif
US11423904B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
CN110021307B (zh) 音频校验方法、装置、存储介质及电子设备
CN109272991B (zh) 语音交互的方法、装置、设备和计算机可读存储介质
WO2019242414A1 (fr) Procédé et appareil de traitement vocal, support d'informations et dispositif électronique
CN105206271A (zh) 智能设备的语音唤醒方法及实现所述方法的系统
KR20160005050A (ko) 키워드 검출을 위한 적응적 오디오 프레임 프로세싱
CN110223687B (zh) 指令执行方法、装置、存储介质及电子设备
CN111210829A (zh) 语音识别方法、装置、系统、设备和计算机可读存储介质
CN111145763A (zh) 一种基于gru的音频中的人声识别方法及系统
WO2014173325A1 (fr) Procédé et dispositif de reconnaissance de gutturophonie
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
WO2023088083A1 (fr) Procédé et appareil d'amélioration de la parole
CN114067782A (zh) 音频识别方法及其装置、介质和芯片系统
WO2022199405A1 (fr) Procédé et appareil de commande vocale
WO2017177629A1 (fr) Procédé et dispositif de reconnaissance vocale de conversation éloignée
WO2022007846A1 (fr) Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage
US11290802B1 (en) Voice detection using hearable devices
CN115206306A (zh) 语音交互方法、装置、设备及系统
WO2019041871A1 (fr) Procédé et dispositif de reconnaissance d'objet vocal
WO2020073839A1 (fr) Procédé, appareil et système de réveil vocal et dispositif électronique
CN108337620A (zh) 一种语音控制的扩音器及其控制方法
CN112017662B (zh) 控制指令确定方法、装置、电子设备和存储介质
CN208337877U (zh) 一种语音控制的扩音器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815256

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815256

Country of ref document: EP

Kind code of ref document: A1