WO2019233228A1 - Electronic device and device control method - Google Patents

Electronic device and device control method Download PDF

Info

Publication number
WO2019233228A1
WO2019233228A1 PCT/CN2019/085554 CN2019085554W WO2019233228A1 WO 2019233228 A1 WO2019233228 A1 WO 2019233228A1 CN 2019085554 W CN2019085554 W CN 2019085554W WO 2019233228 A1 WO2019233228 A1 WO 2019233228A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
preset
voiceprint feature
processing unit
integrated circuit
Prior art date
Application number
PCT/CN2019/085554
Other languages
French (fr)
Chinese (zh)
Inventor
陈岩
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019233228A1 publication Critical patent/WO2019233228A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the technical field of electronic devices, and in particular, to an electronic device and a device control method.
  • voice recognition technology in electronic devices is becoming more and more widespread.
  • voice control of electronic devices can be achieved. For example, users can speak specific voice instructions to control electronic devices to take pictures and play music.
  • an embodiment of the present application provides an electronic device that includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than the power consumption of the central processor.
  • the application-specific integrated circuit chip is configured to obtain an external audio signal
  • the application-specific integrated circuit chip is further configured to perform a recognition operation on the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip is further configured to send instruction information indicating completion of the identification operation to the central processing unit;
  • the central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.
  • an embodiment of the present application provides a method for controlling a device, which is applied to an electronic device.
  • the electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the central processing unit. Processor power consumption.
  • the device control method includes:
  • the application-specific integrated circuit chip acquires an external audio signal
  • the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
  • the central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
  • FIG. 1 is a first schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 2 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 3 is a third schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 4 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a device control method according to an embodiment of the present application.
  • FIG. 6 is a detailed flowchart of identifying an audio signal by an application specific integrated circuit chip in the embodiment of the present application.
  • FIG. 7 is a detailed flowchart of a target operation performed by a central processing unit according to an embodiment of the present application.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • the electronic device 100 includes an application-specific integrated circuit chip 101 and a central processing unit 102, and the power consumption of the application-specific integrated circuit chip 101 is less than the power consumption of the central processing unit 102.
  • the ASIC chip 101 is used to obtain an external audio signal, perform a recognition operation on the acquired audio signal, obtain a recognition result, and send instruction information indicating completion of the recognition operation to the central processing unit 102.
  • the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption.
  • the ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
  • the application-specific integrated circuit chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the application-specific integrated circuit chip 101 can be paired with a built-in microphone (not shown in FIG. 1) of the electronic device. The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
  • the ASIC chip 101 when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal.
  • the digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz.
  • the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
  • the application specific integrated circuit chip 101 After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
  • the recognition mode of the ASIC chip 101 is configured as gender recognition
  • the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
  • the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.)
  • the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
  • the application-specific integrated circuit chip 101 After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102.
  • the function of the instruction information is to inform the central processing unit 102,
  • the ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101.
  • the foregoing indication information may be sent in the form of an interrupt signal.
  • the central processing unit 102 is configured to extract the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and execute a target operation corresponding to the foregoing recognition result.
  • the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
  • the central processing unit 102 After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male” is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female”, the theme mode of the operating system is switched to a feminine theme mode.
  • the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene” As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
  • the electronic device in the embodiment of the present application includes a central processing unit 102 and an application-specific integrated circuit chip 101.
  • the application-specific integrated circuit chip 101 with low power consumption obtains external audio signals, and performs identification operations on the acquired audio signals.
  • the central processor 102 sends the instruction information indicating the completion of the recognition operation to the central processor 102, and the central processor 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and executes the target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101.
  • the manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
  • the ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013.
  • the pre-processing unit 1012 is configured to extract the Mel frequency cepstrum coefficient of the audio signal using the Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit 1011;
  • the algorithm unit 1013 is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit 1011 to obtain candidate keywords and the confidence of the candidate keywords.
  • the micro control unit 1011 first obtains external audio signals through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sounds through a built-in microphone (not shown in FIG. 2) of the electronic device. An external audio signal is obtained. For another example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
  • the micro control unit 1011 when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
  • the micro control unit 1011 After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
  • the pre-processing unit 1012 After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
  • the micro-control unit 1011 After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
  • the algorithm unit 1013 After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
  • the ASIC chip 101 further includes a memory 1014 for storing the acquired audio signals, identifying candidate keywords, confidence levels, and the preprocessing unit 1012 and the algorithm unit 1013 during the execution process. Intermediate data generated in.
  • the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011.
  • the frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
  • the ASIC chip 101 further includes a cache memory 1015 for buffering data stored in the memory 1014 and data retrieved from the memory 1014.
  • the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed.
  • the cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
  • the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal
  • the pre-processing unit 1012 when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
  • the pre-processing unit 1012 pre-processes the audio signal before extracting the Mel-frequency cepstrum coefficient of the audio signal by using the Mel-frequency cepstrum coefficient algorithm, and after the pre-processing of the audio signal is used, The Mel frequency cepstrum coefficient algorithm extracts the Mel frequency cepstrum coefficient of the audio signal.
  • the pre-processing unit 1012 after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
  • pre-emphasis is to increase the energy of the high-frequency part of the audio signal.
  • the energy in the low-frequency part is often higher than the energy in the high-frequency part.
  • the spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part.
  • windowing Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing.
  • the window is described by three parameters: window length (in milliseconds), offset, and shape.
  • window length in milliseconds
  • offset offset
  • shape shape
  • Each windowed audio signal is called a frame
  • the milliseconds of each frame is called the frame length
  • the distance between the left borders of two adjacent frames is called the frame shift.
  • a Hamming window with edge smoothing down to 0 may be used for windowing processing.
  • the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
  • the pre-processing unit 1012 is further configured to extract a voiceprint feature of the audio signal before preprocessing the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and determine the voiceprint feature in the voiceprint feature. When matched with the preset voiceprint characteristics, the audio signal is pre-processed.
  • the voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
  • the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
  • the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
  • the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
  • the preprocessing unit 1012 After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature.
  • the preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
  • the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient.
  • the pre-processing unit 1012 is further configured to obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, determine whether the acquired similarity is greater than or equal to the first preset similarity, and When the similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature.
  • the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
  • the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained.
  • the micro control unit 1011 For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
  • the pre-processing unit 1012 is further configured to obtain current location information when the obtained similarity is less than the first preset similarity and greater than or equal to the second preset similarity, and determine the current location information based on the location information. Whether it is located within the preset position range, and when it is currently within the preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.
  • the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
  • the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
  • the preprocessing unit 1012 further obtains Current location information.
  • the pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information.
  • the preset position range can be configured as a common position range of the owner, such as home and company.
  • the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.
  • the central processing unit 102 is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence of the candidate keyword reaches a preset confidence level, according to a preset keyword and a preset operation.
  • a preset operation corresponding to a target keyword is determined as a target operation, and the target operation is performed.
  • the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
  • the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation.
  • the corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword “Little Europe, Little Europe” can be set to "wake the operating system", so when the target keyword is In “Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
  • An embodiment of the present application provides a device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than that of the central processing unit. Consumption, the device control method includes:
  • the application-specific integrated circuit chip acquires an external audio signal
  • the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result
  • the application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
  • the central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit.
  • the application-specific integrated circuit chip identifies the audio signal and obtains a recognition result, including:
  • the preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;
  • the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .
  • the performing a target operation corresponding to the recognition result includes:
  • the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target.
  • the preset operation of the keyword is determined as the target operation, and the target operation is performed.
  • the application-specific integrated circuit chip further includes a memory
  • the device control method further includes:
  • the memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.
  • the application-specific integrated circuit chip further includes a cache memory
  • the device control method further includes:
  • the cache memory caches data stored in the memory and data fetched from the memory.
  • the method before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
  • the pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.
  • the method before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
  • the pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.
  • determining whether the voiceprint feature matches a preset voiceprint feature includes:
  • the preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.
  • the device control method provided in the embodiment of the present application further includes:
  • the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.
  • the device control method provided in the embodiment of the present application further includes:
  • the preprocessing unit instructs the micro-control unit to delete the audio signal.
  • an embodiment of the present application further provides a device control method.
  • the device control method is executed by an electronic device provided in the embodiment of the present application.
  • the electronic device includes an application specific integrated circuit chip 101 and a central processing unit 102, and the application specific integrated circuit chip.
  • the power consumption of 101 is smaller than the power consumption of the central processing unit 102. Please refer to FIG. 5.
  • the device control method includes:
  • the application specific integrated circuit chip 101 obtains an external audio signal.
  • the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption.
  • the ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
  • the ASIC chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the ASIC chip 101 may be The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
  • the ASIC chip 101 when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal.
  • the digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz.
  • the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
  • the application-specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal to obtain a recognition result.
  • the application specific integrated circuit chip 101 After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
  • the recognition mode of the ASIC chip 101 is configured as gender recognition
  • the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
  • the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.)
  • the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
  • the application specific integrated circuit chip 101 sends instruction information indicating completion of the identification operation to the central processing unit 102.
  • the application-specific integrated circuit chip 101 After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102.
  • the function of the instruction information is to inform the central processing unit 102,
  • the ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101.
  • the foregoing indication information may be sent in the form of an interrupt signal.
  • the central processing unit 102 extracts the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and performs a target operation corresponding to the foregoing recognition result.
  • the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
  • the central processing unit 102 After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
  • the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male” is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female”, the theme mode of the operating system is switched to a feminine theme mode.
  • the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene” As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
  • the electronic device in the embodiment of the present application first obtains an external audio signal from the application-specific integrated circuit chip 101 with low power consumption, performs a recognition operation on the acquired audio signal, obtains a recognition result, and sends an instruction for the recognition operation.
  • the completed instruction information is sent to the central processing unit 102, and the central processing unit 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and performs a target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101.
  • the manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
  • the ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013. Referring to FIG. 6, the ASIC chip 101 performs a recognition operation on the acquired audio signal. Steps to get recognition results, including:
  • the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the control of the micro control unit 1011;
  • the algorithm unit 1013 uses the deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficients according to the control of the micro control unit 1011, and obtains the candidate keywords and the confidence of the candidate keywords.
  • the micro control unit 1011 first obtains an external audio signal through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sound through a built-in microphone (not shown in FIG. 2) of the electronic device to obtain an external audio signal. For example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
  • the micro control unit 1011 when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected.
  • the micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
  • the micro control unit 1011 After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
  • the pre-processing unit 1012 After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
  • the micro-control unit 1011 After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
  • the algorithm unit 1013 After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
  • the ASIC chip 101 further includes a memory 1014.
  • the memory 1014 can be used to store the acquired audio signals, identify candidate keywords, confidence, and during the execution of the preprocessing unit 1012 and the algorithm unit 1013 Generated intermediate data.
  • the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011.
  • the frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
  • the ASIC chip 101 further includes a cache memory 1015, which can be used to cache data stored in the memory 1014 and data retrieved from the memory 1014.
  • the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed.
  • the cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
  • the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal
  • the pre-processing unit 1012 when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
  • the central processing unit 102 executes a target operation corresponding to the foregoing recognition result, including:
  • the central processing unit 102 uses the candidate keywords as target keywords of the audio signal when the confidence of the candidate keywords reaches a preset confidence level;
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation, and executes the target operation.
  • the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
  • the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
  • the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation.
  • the corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword “Little Europe, Little Europe” can be set to "wake the operating system", so when the target keyword is In “Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
  • the method further includes:
  • the pre-processing unit 1012 pre-processes the audio signal
  • the preprocessing unit 1012 After the preprocessing unit 1012 finishes preprocessing the audio signal, it uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the pre-processing unit 1012 after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
  • pre-emphasis is to increase the energy of the high-frequency part of the audio signal.
  • the energy in the low-frequency part is often higher than the energy in the high-frequency part.
  • the spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part.
  • windowing Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing.
  • the window is described by three parameters: window length (in milliseconds), offset, and shape.
  • window length in milliseconds
  • offset offset
  • shape shape
  • Each windowed audio signal is called a frame
  • the milliseconds of each frame is called the frame length
  • the distance between the left borders of two adjacent frames is called the frame shift.
  • a Hamming window with edge smoothing down to 0 may be used for windowing processing.
  • the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
  • the process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
  • the method before the step of preprocessing the audio signal by the preprocessing unit 1012, the method further includes:
  • the preprocessing unit 1012 extracts the voiceprint features of the audio signal
  • the preprocessing unit 1012 determines whether the extracted voiceprint features match the preset voiceprint features
  • the pre-processing unit 1012 pre-processes the aforementioned audio signal when the extracted voiceprint features match the preset voiceprint features.
  • the voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
  • the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
  • the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
  • the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
  • the preprocessing unit 1012 After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature.
  • the preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
  • the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient.
  • the step of the pre-processing unit 1012 determining whether the extracted voiceprint features match the preset voiceprint features includes:
  • the preprocessing unit 1012 obtains the similarity between the aforementioned voiceprint feature and the preset voiceprint feature
  • the preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity
  • the preprocessing unit 1012 determines that the obtained voiceprint feature matches the preset voiceprint feature.
  • the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
  • the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
  • the micro control unit 1011 After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained.
  • the micro control unit 1011 For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
  • the method further includes:
  • the pre-processing unit 1012 obtains the current position information when the aforementioned similarity is less than the first preset similarity and greater than or equal to the second preset similarity;
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the obtained position information
  • the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
  • the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
  • the preprocessing unit 1012 further obtains Current location information.
  • the pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
  • the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information.
  • the preset position range can be configured as a common position range of the owner, such as home and company.
  • the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

An electronic device and a device control method. The electronic device comprises a central processing unit and an application-specific integrated circuit chip. The method comprises: the application-specific integrated circuit chip acquires an external audio signal (101); the application-specific integrated circuit chip performs an identification operation on the acquired audio signal to obtain an identification result (102); the application-specific integrated circuit chip sends indication information indicating that the identification operation is completed to the central processing unit (103); the central processing unit extracts the identification result from the application-specific integrated circuit chip according to the received indication information, and performs a target operation corresponding to the identification result (104).

Description

电子设备及设备控制方法Electronic equipment and equipment control method
本申请要求于2018年06月08日提交中国专利局、申请号为201810589643.2、发明名称为“电子设备及设备控制方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the Chinese Patent Office on June 08, 2018, with an application number of 201810589643.2 and an invention name of "Electronic Equipment and Equipment Control Methods", the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及电子设备技术领域,具体涉及一种电子设备及设备控制方法。The present application relates to the technical field of electronic devices, and in particular, to an electronic device and a device control method.
背景技术Background technique
目前,语音识别技术在电子设备的应用越来越广泛,利用语音识别技术,可以实现对电子设备的语音控制,比如,用户可以说出特定的语音指令,来控制电子设备拍照、播放音乐等。At present, the application of voice recognition technology in electronic devices is becoming more and more widespread. Using voice recognition technology, voice control of electronic devices can be achieved. For example, users can speak specific voice instructions to control electronic devices to take pictures and play music.
发明内容Summary of the Invention
第一方面,本申请实施例提供了一种电子设备,该电子设备包括中央处理器和专用集成电路芯片,且所述专用集成电路芯片的功耗小于所述中央处理器的功耗,其中,In a first aspect, an embodiment of the present application provides an electronic device that includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than the power consumption of the central processor.
所述专用集成电路芯片用于获取外部的音频信号;The application-specific integrated circuit chip is configured to obtain an external audio signal;
所述专用集成电路芯片还用于对所述音频信号进行识别操作,得到识别结果;The application-specific integrated circuit chip is further configured to perform a recognition operation on the audio signal to obtain a recognition result;
所述专用集成电路芯片还用于发送指示识别操作完成的指示信息至所述中央处理器;The application-specific integrated circuit chip is further configured to send instruction information indicating completion of the identification operation to the central processing unit;
所述中央处理器用于根据所述指示信息,从所述专用集成电路芯片提取所述识别结果,并执行对应所述识别结果的目标操作。The central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.
第二方面,本申请实施例了提供了的一种设备控制方法,应用于电子设备,该电子设备包括中央处理器和专用集成电路芯片,且所述专用集成电路芯片的功耗小于所述中央处理器的功耗,该设备控制方法包括:In a second aspect, an embodiment of the present application provides a method for controlling a device, which is applied to an electronic device. The electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the central processing unit. Processor power consumption. The device control method includes:
所述专用集成电路芯片获取外部的音频信号;The application-specific integrated circuit chip acquires an external audio signal;
所述专用集成电路芯片对所述音频信号进行识别,得到识别结果;The application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result;
所述专用集成电路芯片发送识别完成的指示信息至所述中央处理器;The application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
所述中央处理器根据所述指示信息,从所述专用集成电路芯片提取所述识别结果,并执行对应所述识别结果的目标操作。The central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those skilled in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1为本申请实施例提供的电子设备的第一结构示意图。FIG. 1 is a first schematic structural diagram of an electronic device according to an embodiment of the present application.
图2是本申请实施例提供的电子设备的第二结构示意图。FIG. 2 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
图3是本申请实施例提供的电子设备的第三结构示意图。FIG. 3 is a third schematic structural diagram of an electronic device according to an embodiment of the present application.
图4是本申请实施例提供的电子设备的第四结构示意图。FIG. 4 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present application.
图5是本申请实施例提供的设备控制方法的流程示意图。FIG. 5 is a schematic flowchart of a device control method according to an embodiment of the present application.
图6是本申请实施例中专用集成电路芯片对音频信号进行识别的细化流程示意图。FIG. 6 is a detailed flowchart of identifying an audio signal by an application specific integrated circuit chip in the embodiment of the present application.
图7是本申请实施例中央处理器执行目标操作的细化流程示意图。FIG. 7 is a detailed flowchart of a target operation performed by a central processing unit according to an embodiment of the present application.
具体实施方式Detailed ways
应当理解,在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。It should be understood that a reference to "an embodiment" herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
本申请实施例提供一种电子设备,请参照图1,电子设备100包括专用集成电路芯片101和中央处理器102,且专用集成电路芯片101的功耗小于中央处理器102的功耗,其中,An embodiment of the present application provides an electronic device. Referring to FIG. 1, the electronic device 100 includes an application-specific integrated circuit chip 101 and a central processing unit 102, and the power consumption of the application-specific integrated circuit chip 101 is less than the power consumption of the central processing unit 102.
专用集成电路芯片101用于获取外部的音频信号,对获取到的音频信号进行识别操作,得到识别结果,并发送指示识别操作完成的指示信息至中央处理器102。The ASIC chip 101 is used to obtain an external audio signal, perform a recognition operation on the acquired audio signal, obtain a recognition result, and send instruction information indicating completion of the recognition operation to the central processing unit 102.
需要说明的是,本申请实施例中的专用集成电路芯片101是以音频识别为目的而设计的专用集成电路,其相较于通用的中央处理器102,具有更高的音频识别效率以及更低的功耗。专用集成电路芯片101与中央处理器102通过通信总线建立数据通信连接It should be noted that the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption. The ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
其中,专用集成电路芯片101可以通过多种不同方式来获取外部的音频信号,比如,在电子设备未外接麦克风时,专用集成电路芯片101可以通过电子设备内置的麦克风(图1未示出)对外部发音者发出的声音进行采集,得到外部的音频信号;又比如,在电子设备外接有麦克风时,专用集成电路芯片101可以通过电子设备外接的麦克风对外部声音进行采集,得到外部的音频信号。The application-specific integrated circuit chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the application-specific integrated circuit chip 101 can be paired with a built-in microphone (not shown in FIG. 1) of the electronic device. The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
其中,专用集成电路芯片101在通过麦克风采集外部的音频信号时,若麦克风为模拟麦克风,将采集到模拟的音频信号,专用集成电路芯片101需要对模拟的音频信号进行采样,将模拟的音频信号转换为数字化的音频信号,比如,可以16KHz的采样频率进行采样;此外,若麦克风为数字麦克风,专用集成电路芯片101将通过数字麦克风直接采集到数字化的音频信号,无需进行转换。Among them, when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal. The digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz. In addition, if the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
在获取到外部的音频信号之后,专用集成电路芯片101根据预先配置的识别模式,对获取到的音频信号进行识别操作,得到识别结果。After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
比如,在专用集成电路芯片101的识别模式被配置为性别识别时,专用集成电路芯片101在对获取到的音频信号进行识别时,将从音频信号中提取出能够表征性别的特征信息,并根据提取出的特征信息,对音频信号的发音者的性别进行识别,得到该发音者为男、或为女的识别结果。For example, when the recognition mode of the ASIC chip 101 is configured as gender recognition, when the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
又比如,在专用集成电路芯片101的识别模式被配置为环境类型(地铁车厢场景、公 交车厢场景、办公室场景等)识别时,专用集成电路芯片101在对获取到的音频信号进行识别时,将从音频信号中提取出能够表征环境场景的特征信息,并根据提取出的特征信息对当前所处的环境场景进行识别,得到用于描述当前环境场景类型的识别结果。For another example, when the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.), when the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
在完成对音频信号的识别操作,并得到识别结果之后,专用集成电路芯片101发送指示识别操作完成的指示信息至中央处理器102,形象的说,该指示信息的作用在于告知中央处理器102,专用集成电路芯片101已经完成对音频信号的识别操作,可以从专用集成电路芯片101提取识别结果。其中,前述指示信息可以中断信号信号的形式发送。After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102. Visually speaking, the function of the instruction information is to inform the central processing unit 102, The ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101. The foregoing indication information may be sent in the form of an interrupt signal.
中央处理器102用于根据接收到的指示信息,从专用集成电路芯片101提取前述识别结果,并执行对应前述识别结果的目标操作。The central processing unit 102 is configured to extract the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and execute a target operation corresponding to the foregoing recognition result.
相应的,中央处理器102在接收到来自专用集成电路芯片101的指示信息之后,根据该指示信息,从专用集成电路芯片101处提取专用集成电路芯片101对音频信号进行识别所得到的识别结果。Correspondingly, after receiving the instruction information from the application specific integrated circuit chip 101, the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
在提取到音频信号的识别结果之后,中央处理器102进一步执行对应该识别结果的目标操作。After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
比如,在专用集成电路芯片101被配置为性别识别时,若提取到“发音者为男”的识别结果,则将操作系统的主题模式切换为男性化的主题模式,若提取到“发音者为女”的识别结果,则将操作系统的主题模式切换为女性化的主题模式。For example, when the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male" is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female", the theme mode of the operating system is switched to a feminine theme mode.
又比如,在专用集成电路芯片101被配置为环境类型识别时,若提取到“办公室场景”的识别结果,则将操作系统的提示模式切换为静音模式,若提取到“公交车厢场景”的识别结果,则将操作系统的提示模式切换为振动+响铃模式等等。For another example, when the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene" As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
由上可知,本申请实施例的电子设备包括中央处理器102和专用集成电路芯片101,首先由功耗较低的专用集成电路芯片101获取外部的音频信号,对获取到的音频信号进行识别操作,得到识别结果,并发送指示识别操作完成的指示信息至中央处理器102,再由中央处理器102根据指示信息,从专用集成电路芯片101提取识别结果,并执行对应识别结果的目标操作。由此,将中央处理器102的音频识别任务分担至功耗较低的专用集成电路芯片101完成,并由中央处理器102根据专用集成电路芯片101的识别结果执行对应的目标操作,通过这种专用集成电路协同中央处理器102进行对电子设备语音控制的方式,能够降低电子设备实现语音控制的功耗。It can be known from the above that the electronic device in the embodiment of the present application includes a central processing unit 102 and an application-specific integrated circuit chip 101. First, the application-specific integrated circuit chip 101 with low power consumption obtains external audio signals, and performs identification operations on the acquired audio signals. To obtain the recognition result, and send the instruction information indicating the completion of the recognition operation to the central processor 102, and the central processor 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and executes the target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101. The manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
在一实施方式中,请参照图2,专用集成电路芯片101包括微控制单元1011、预处理单元1012以及算法单元1013,其中,In an embodiment, please refer to FIG. 2. The ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013.
预处理单元1012用于根据微控制单元1011的控制,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数;The pre-processing unit 1012 is configured to extract the Mel frequency cepstrum coefficient of the audio signal using the Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit 1011;
算法单元1013用于根据微控制单元1011的控制,使用深度神经网络算法对梅尔频率倒谱系数进行关键词识别,得到候选关键词以及候选关键词的置信度。The algorithm unit 1013 is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit 1011 to obtain candidate keywords and the confidence of the candidate keywords.
其中,微控制单元1011首先通过麦克风获取到外部的音频信号,比如,在电子设备未外接麦克风时,微控制单元1011可以通过电子设备内置的麦克风(图2未示出)对外部声音进行采集,得到外部的音频信号;又比如,在电子设备外接有麦克风时,微控制单元1011可以通过电子设备外接的麦克风对外部声音进行采集,得到外部的音频信号。The micro control unit 1011 first obtains external audio signals through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sounds through a built-in microphone (not shown in FIG. 2) of the electronic device. An external audio signal is obtained. For another example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
其中,微控制单元1011在通过麦克风采集外部的音频信号时,若麦克风为模拟麦克风,将采集到模拟的音频信号,微控制单元1011需要对模拟的音频信号进行采样,将模拟的音频信号转换为数字化的音频信号,比如,可以16KHz的采样频率进行采样;此外,若麦克风为数字麦克风,微控制单元1011将通过数字麦克风直接采集到数字化的音频信号,无需进行转换。Among them, when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
在获取到外部的音频信号之后,微控制单元1011生成第一控制信息,将该第一控制信息发送至预处理单元1012。After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,根据该第一控制信息,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。在提取到音频信号的梅尔频率倒谱系数之后,预处理单元1012发送第一反馈信息至微控制单元1011。After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
微控制单元1011在接收到来自预处理单元1012的第一反馈信息之后,确定预处理单元1012当前已经提取到音频信号的梅尔频率倒谱系数,此时生成第二控制信息,After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
算法单元1013在接收到来自微控制单元1011的第二控制信息之后,使用内置的深度神经网络算法,对前述梅尔频率倒谱系数进行关键词识别(关键词识别也即是检测音频信号对应的语音中是否出现预先定义的单词),得到候选关键词以及候选关键词的置信度。在完成关键词识别并识别得到候选关键词以及候选关键词的置信度之后,算法单元1013发送第二反馈信息至微控制单元1011。After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
微控制单元1011在接收到来自算法单元1013的第二反馈信息之后,确定算法单元1013已经完成关键词识别,将算法单元1013识别得到的候选关键词以及候选关键词的置信度作为此次对音频信号进行识别操作的识别结果。After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
在一实施方式中,请参照图3,专用集成电路芯片101还包括内存1014,用于存储获取到的音频信号、识别出候选关键词、置信度以及预处理单元1012和算法单元1013在执行过程中产生的中间数据。In an embodiment, please refer to FIG. 3. The ASIC chip 101 further includes a memory 1014 for storing the acquired audio signals, identifying candidate keywords, confidence levels, and the preprocessing unit 1012 and the algorithm unit 1013 during the execution process. Intermediate data generated in.
比如,微控制单元1011将通过麦克风获取到的音频信号存储在内存1014中;预处理单元1012根据微控制单元1011的控制,使用梅尔频率倒谱系数算法提取内存1014中存储的音频信号的梅尔频率倒谱系数,并将提取出的梅尔频率倒谱系数存储在内存1014中;算法单元1013根据微控制单元1011的控制,使用内置的深度神经网络算法,对内存1014中存储的梅尔频率倒谱系数进行关键词识别,得到候选关键词以及候选关键词的置信度,将得到候选关键词以及候选关键词的置信度存储在内存1014中。For example, the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011. The frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
在一实施方式中,请参照图4,专用集成电路芯片101还包括高速缓冲存储器1015, 用于对存入内存1014的数据、从内存1014中取出的数据进行缓存。In an embodiment, please refer to FIG. 4. The ASIC chip 101 further includes a cache memory 1015 for buffering data stored in the memory 1014 and data retrieved from the memory 1014.
其中,高速缓冲存储器1015相较于内存1014其存储空间较小,但速度更高,通过高速缓冲存储器1015可以提升预处理单元1012以及算法单元1013的处理效率。Among them, the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed. The cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
比如,预处理单元1012在对音频信号进行梅尔频率倒谱系数的提取时,当预处理单元1012直接从内存1014中存取数据时要等待一定时间周期,而高速缓冲存储器1015则可以保存预处理单元1012刚用过或循环使用的一部分数据,如果预处理单元1012需要再次使用该部分数据时可从高速缓冲存储器1015中直接调用,这样就避免了重复存取数据,减少了预处理单元1012的等待时间,从而提升了其处理效率。For example, when the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal, when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
在一实施方式中,预处理单元1012在使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数之前,还对音频信号进行预处理,在完成对音频信号的预处理之后,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。In an embodiment, the pre-processing unit 1012 pre-processes the audio signal before extracting the Mel-frequency cepstrum coefficient of the audio signal by using the Mel-frequency cepstrum coefficient algorithm, and after the pre-processing of the audio signal is used, The Mel frequency cepstrum coefficient algorithm extracts the Mel frequency cepstrum coefficient of the audio signal.
其中,预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,首先对对音频信号进行预加重和加窗等预处理。Among them, after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
其中,预加重也即是增加音频信号高频部分的能量。对于音频信号的频谱来说,往往低频部分的能量高于高频部分的能量,每经过10倍Hz,频谱能量就会衰减20dB,而且由于麦克风在采集音频信号时电路本底噪声的影响,也会增加低频部分的能量,为使高频部分的能量和低频部分能量有相似的幅度,需要预加强采集到音频信号的高频能量。Among them, pre-emphasis is to increase the energy of the high-frequency part of the audio signal. For the frequency spectrum of audio signals, the energy in the low-frequency part is often higher than the energy in the high-frequency part. The spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part. In order to make the energy of the high frequency part and the energy of the low frequency part have a similar amplitude, it is necessary to pre-enhance the high frequency energy of the collected audio signal.
由于音频信号一般是非平稳信号,其统计特性不是固定不变的,但在一段相当短的时间内,可以认为信号时平稳的,这就是加窗。窗由三个参数来描述:窗长(单位毫秒)、偏移和形状。每一个加窗的音频信号叫做一帧,每一帧的毫秒数叫做帧长,相邻两帧左边界的距离叫帧移。本申请实施例中,可以使用边缘平滑降到0的汉明窗进行加窗处理。Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing. The window is described by three parameters: window length (in milliseconds), offset, and shape. Each windowed audio signal is called a frame, the milliseconds of each frame is called the frame length, and the distance between the left borders of two adjacent frames is called the frame shift. In the embodiment of the present application, a Hamming window with edge smoothing down to 0 may be used for windowing processing.
在完成对音频信号的预处理之后,预处理单元1012即可使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。其中,预处理单元1012提取梅尔频率倒谱系数的过程大致为:利用人耳听觉的非线性特性,将音频信号的频谱转换为基于梅尔频率的非线性频谱,再转换到倒谱域,由此得到梅尔频率倒谱系数。After pre-processing the audio signal, the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal. The process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
在一实施方式中,预处理单元1012还用于在对音频信号进行预处理之前,提取音频信号的声纹特征,判断该声纹特征是否与预设声纹特征匹配,并在该声纹特征与预设声纹特征匹配时,对音频信号进行预处理。In an embodiment, the pre-processing unit 1012 is further configured to extract a voiceprint feature of the audio signal before preprocessing the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and determine the voiceprint feature in the voiceprint feature. When matched with the preset voiceprint characteristics, the audio signal is pre-processed.
需要说明的是,在实际生活中,每个人说话时的声音都有自己的特点,熟悉的人之间,可以只听声音而相互辨别出来。这种声音的特点就是声纹特征,声纹特征主要由两个因素决定,第一个是声腔的尺寸,具体包括咽喉、鼻腔和口腔等,这些器官的形状、尺寸和位置决定了声带张力的大小和声音频率的范围。因此不同的人虽然说同样的话,但是声音的频率分布是不同的,听起来有的低沉有的洪亮。It should be noted that, in actual life, everyone's voice when speaking has its own characteristics. Familiar people can distinguish each other only by listening to the voice. The characteristic of this sound is the voiceprint feature. The voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
第二个决定声纹特征的因素是发声器官被操纵的方式,发声器官包括唇、齿、舌、软 腭及腭肌肉等,他们之间相互作用就会产生清晰的语音。而他们之间的协作方式是人通过后天与周围人的交流中随机学习到的。人在学习说话的过程中,通过模拟周围不同人的说话方式,就会逐渐形成自己的声纹特征。The second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated. The vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
其中,预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,首先提取音频信号的声纹特征。Wherein, after receiving the first control information from the micro control unit 1011, the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
在获取到语音信息的声纹特征之后,预处理单元1012进一步将获取到的该声纹特征与预设声纹特征进行进行比对,以判断该声纹特征是否与预设声纹特征匹配。其中,预设声纹特征可以为机主预先录入的声纹特征,判断获取的音频信号的声纹特征是否与预设声纹特征匹配,也即是判断音频信号的发音者是否为机主。After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature. The preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
在获取到的声纹特征与预设声纹特征匹配时,预处理单元1012确定音频信号的发音者为机主,此时进一步对音频信号进行预处理,并提取出梅尔频率倒谱系数,具体可参照以上相关描述,此处不再赘述。When the acquired voiceprint features match the preset voiceprint features, the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient. For details, refer to the related descriptions above, and details are not described herein again.
在一实施方式中,预处理单元1012还用于获取前述声纹特征和预设声纹特征的相似度,判断获取到的相似度是否大于或等于第一预设相似度,并在获取到的相似度大于或等于第一预设相似度时,确定获取到的声纹特征与预设声纹特征匹配。In an embodiment, the pre-processing unit 1012 is further configured to obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, determine whether the acquired similarity is greater than or equal to the first preset similarity, and When the similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature.
其中,预处理单元1012在判断获取到的声纹特征是否与预设声纹特征匹配时,可以获取该声纹特征(即从前述音频信号所获取到的声纹特征)与预设声纹特征的相似度,并判断获取到的相似度是否大于或等于第一预设相似度(根据实际需要进行设置,比如,可以设置为95%)。若获取到的相似度大于或等于第一预设相似度,则确定获取到的声纹特征与预设声纹特征匹配;若获取到的相似度小于低于相似度,则确定获取到的声纹特征与预设声纹特征不匹配。Wherein, the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
此外,在获取到的声纹特征与预设声纹特征不匹配时,预处理单元1012确定当前音频信号的发音者不为机主,发送第三反馈信息至微控制单元1011。In addition, when the acquired voiceprint features do not match the preset voiceprint features, the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
微控制单元1011在接收到来自预处理单元1012的第三反馈信息之后,删除获取到的音频信号,并继续获取外部的音频信号,直至获取到机主的音频信号时,才对该音频信号进行预处理以及梅尔频率倒谱系数的提取,其中,对于如何进行预处理以及梅尔频率倒谱系数的提取,可参照以上实施例的相关描述,此处不再赘述。After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained. For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
由此,通过这种基于声纹特征对发音者进行身份认证的方式,仅对机主发出的音频信号进行响应,能够避免执行非机主意愿的操作,可以提升机主的使用体验。Therefore, by using the voiceprint feature to authenticate the speaker based on the voiceprint feature, only responding to the audio signal sent by the owner can avoid performing operations that are not intended by the owner, and can improve the user experience.
在一实施方式中,预处理单元1012还用于在获取到的相似度小于第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息,根据该位置信息判断当前是否位于预设位置范围内,并在当前位于预设位置范围内时,确定前述声纹特征与预设声纹特征匹配。In an embodiment, the pre-processing unit 1012 is further configured to obtain current location information when the obtained similarity is less than the first preset similarity and greater than or equal to the second preset similarity, and determine the current location information based on the location information. Whether it is located within the preset position range, and when it is currently within the preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.
需要说明的是,由于声纹特征和人体的生理特征密切相关,在日常生活中,如果用户感冒发炎的话,其声音将变得沙哑,声纹特征也将随之发生变化。在这种情况下,即使获 取到的音频信号由机主说出,预处理单元1012也将无法识别出。此外,还存在多种导致预处理单元1012法识别出机主的情况,此处不再赘述。It should be noted that, because the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
为解决可能出现的、无法识别出机主的情况,预处理单元1012在完成对声纹特征相似度的判断之后,若获取到的声纹特征与预设声纹特征的相似度小于第一预设相似度,则进一步判断该声纹特征是否大于或等于第二预设相似度(该第二预设相似度配置为小于第一预设相似度,具体可由本领域技术人员根据实际需要取合适值,比如,在第一预设相似度被设置为95%时,可以将第二预设相似度设置为75%)。In order to solve the possible situation where the owner cannot be identified, after the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
在判断结果为是,也即是获取到的声纹特征与预设声纹特征的相似度小于第一预设相似度且大于或等于第二预设相似度时,预处理单元1012进一步获取到当前的位置信息。其中,预处理单元1012可以发送位置获取请求至电子设备的定位模组(可以采用卫星定位技术或者基站定位技术等不同的定位技术来获取到当前的位置信息),指示定位模组返回当前的位置信息。When the judgment result is yes, that is, the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset similarity and greater than or equal to the second preset similarity, the preprocessing unit 1012 further obtains Current location information. The pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
在获取到当前的位置信息之后,预处理单元1012根据该位置信息判断当前是否位于预设位置范围内。其中,预设位置范围可以配置为机主的常用位置范围,比如家里和公司等。After acquiring the current position information, the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information. Among them, the preset position range can be configured as a common position range of the owner, such as home and company.
在当前位于预设位置范围内时,预处理单元1012确定获取到的声纹特征与预设声纹特征匹配,将音频信号的发音者识别为机主。When it is currently within the preset position range, the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.
在一实施方式中,中央处理器102还用于在候选关键词的置信度达到预设置信度时,将候选关键词作为音频信号的目标关键词,根据预设的关键词和预设操作的对应关系,将对应目标关键词的预设操作确定为目标操作,并执行该目标操作。In one embodiment, the central processing unit 102 is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence of the candidate keyword reaches a preset confidence level, according to a preset keyword and a preset operation. Correspondence relationship, a preset operation corresponding to a target keyword is determined as a target operation, and the target operation is performed.
其中,中央处理器102在根据专用集成电路芯片101的指示信息,从专用集成电路芯片101提取到识别出的“候选关键词以及候选关键词的置信度”之后,首先判断候选关键词的置信度是否达到预设置信度(可由本领域技术人员根据实际需要取合适值,比如,可以设置为90%)Among them, the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
在完成对置信度的判断,且候选关键词的置信度达到预设置信度时,中央处理器102将候选关键词作为音频信号的目标关键词。After the judgment of the confidence level is completed, and the confidence level of the candidate keywords reaches a preset confidence level, the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
之后,中央处理器102根据预设的关键词和预设操作的对应关系,将对应目标关键词的预设操作确定为目标操作。其中,关键词和预设操作的对应关系可根据实际需要进行设置,比如,可以设置关键词“小欧,小欧”对应的预设操作为“唤醒操作系统”,这样,当目标关键词为“小欧,小欧”时,若操作系统当前处于休眠状态,中央处理器102将唤醒操作系统。After that, the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation. The corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword "Little Europe, Little Europe" can be set to "wake the operating system", so when the target keyword is In "Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
本申请实施例提供一种设备控制方法,应用于电子设备,其中,所述电子设备包括中央处理器和专用集成电路芯片,且所述专用集成电路芯片的功耗小于所述中央处理器的功耗,所述设备控制方法包括:An embodiment of the present application provides a device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than that of the central processing unit. Consumption, the device control method includes:
所述专用集成电路芯片获取外部的音频信号;The application-specific integrated circuit chip acquires an external audio signal;
所述专用集成电路芯片对所述音频信号进行识别,得到识别结果;The application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result;
所述专用集成电路芯片发送识别完成的指示信息至所述中央处理器;The application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
所述中央处理器根据所述指示信息,从所述专用集成电路芯片提取所述识别结果,并执行对应所述识别结果的目标操作。The central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
在一实施例中,所述专用集成电路芯片包括微控制单元、预处理单元以及算法单元,所述专用集成电路芯片对所述音频信号进行识别,得到识别结果,包括:In an embodiment, the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit. The application-specific integrated circuit chip identifies the audio signal and obtains a recognition result, including:
所述预处理单元根据所述微控制单元的控制,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数;The preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;
所述算法单元根据所述微控制单元的控制,使用深度神经网络算法对所述梅尔频率倒谱系数进行关键词识别,得到候选关键词以及所述候选关键词的置信度作为所述识别结果。According to the control of the micro control unit, the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .
在一实施例中,所述执行对应所述识别结果的目标操作包括:In an embodiment, the performing a target operation corresponding to the recognition result includes:
中央处理器在所述置信度达到预设置信度时,将所述候选关键词作为所述音频信号的目标关键词,根据预设的关键词和预设操作的对应关系,将对应所述目标关键词的预设操作确定为所述目标操作,并执行所述目标操作。When the confidence level reaches a preset confidence level, the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target. The preset operation of the keyword is determined as the target operation, and the target operation is performed.
在一实施例中,所述专用集成电路芯片还包括内存,所述设备控制方法还包括:In an embodiment, the application-specific integrated circuit chip further includes a memory, and the device control method further includes:
所述内存存储所述音频信号、所述候选关键词、所述置信度以及所述预处理单元和所述算法单元在执行过程中产生的中间数据。The memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.
在一实施例中,所述专用集成电路芯片还包括高速缓冲存储器,所述设备控制方法还包括:In an embodiment, the application-specific integrated circuit chip further includes a cache memory, and the device control method further includes:
所述高速缓冲存储器对存入所述内存的数据、从所述内存中取出的数据进行缓存。The cache memory caches data stored in the memory and data fetched from the memory.
在一实施例中,在使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数之前,还包括:In an embodiment, before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
所述预处理单元对所述音频信号进行预处理,在完成对所述音频信号的预处理之后,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数。The pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.
在一实施例中,在使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数之前,还包括:In an embodiment, before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
所述预处理单元提取所述音频信号的声纹特征,判断所述声纹特征是否与预设声纹特征匹配,并在所述声纹特征与所述预设声纹特征匹配时,对所述音频信号进行预处理。The pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.
在一实施例中,所述判断所述声纹特征是否与预设声纹特征匹配,包括:In an embodiment, determining whether the voiceprint feature matches a preset voiceprint feature includes:
所述预处理单元获取所述声纹特征和所述预设声纹特征的相似度,判断所述相似度是否大于或等于第一预设相似度,并在所述相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹特征匹配。The preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.
在一实施例中,本申请实施例提供的设备控制方法还包括:In an embodiment, the device control method provided in the embodiment of the present application further includes:
所述预处理单元在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息,根据所述位置信息判断当前是否位于预设位置范围内,并在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。When the similarity is smaller than the first preset similarity and greater than or equal to a second preset similarity, the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.
在一实施例中,本申请实施例提供的设备控制方法还包括:In an embodiment, the device control method provided in the embodiment of the present application further includes:
所述预处理单元在所述声纹特征与所述预设声纹特征不匹配时,指示所述微控制单元删除所述音频信号。When the voiceprint feature does not match the preset voiceprint feature, the preprocessing unit instructs the micro-control unit to delete the audio signal.
进一步地,本申请实施例还提供了一种设备控制方法,该设备控制方法由本申请实施例提供的电子设备执行,该电子设备包括专用集成电路芯片101和中央处理器102,且专用集成电路芯片101的功耗小于中央处理器102的功耗,请参照图5,该设备控制方法包括:Further, an embodiment of the present application further provides a device control method. The device control method is executed by an electronic device provided in the embodiment of the present application. The electronic device includes an application specific integrated circuit chip 101 and a central processing unit 102, and the application specific integrated circuit chip. The power consumption of 101 is smaller than the power consumption of the central processing unit 102. Please refer to FIG. 5. The device control method includes:
101、专用集成电路芯片101获取外部的音频信号。101. The application specific integrated circuit chip 101 obtains an external audio signal.
需要说明的是,本申请实施例中的专用集成电路芯片101是以音频识别为目的而设计的专用集成电路,其相较于通用的中央处理器102,具有更高的音频识别效率以及更低的功耗。专用集成电路芯片101与中央处理器102通过通信总线建立数据通信连接It should be noted that the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption. The ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus
其中,专用集成电路芯片101可以通过多种不同方式来获取外部的音频信号,比如,在电子设备未外接麦克风时,专用集成电路芯片101可以通过电子设备内置的麦克风(图1未示出)对外部发音者发出的声音进行采集,得到外部的音频信号;又比如,在电子设备外接有麦克风时,专用集成电路芯片101可以通过电子设备外接的麦克风对外部声音进行采集,得到外部的音频信号。The ASIC chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the ASIC chip 101 may be The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.
其中,专用集成电路芯片101在通过麦克风采集外部的音频信号时,若麦克风为模拟麦克风,将采集到模拟的音频信号,专用集成电路芯片101需要对模拟的音频信号进行采样,将模拟的音频信号转换为数字化的音频信号,比如,可以16KHz的采样频率进行采样;此外,若麦克风为数字麦克风,专用集成电路芯片101将通过数字麦克风直接采集到数字化的音频信号,无需进行转换。Among them, when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal. The digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz. In addition, if the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.
102、专用集成电路芯片101对获取到的音频信号进行识别操作,得到识别结果。102. The application-specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal to obtain a recognition result.
在获取到外部的音频信号之后,专用集成电路芯片101根据预先配置的识别模式,对获取到的音频信号进行识别操作,得到识别结果。After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.
比如,在专用集成电路芯片101的识别模式被配置为性别识别时,专用集成电路芯片101在对获取到的音频信号进行识别时,将从音频信号中提取出能够表征性别的特征信息,并根据提取出的特征信息,对音频信号的发音者的性别进行识别,得到该发音者为男、或为女的识别结果。For example, when the recognition mode of the ASIC chip 101 is configured as gender recognition, when the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.
又比如,在专用集成电路芯片101的识别模式被配置为环境类型(地铁车厢场景、公交车厢场景、办公室场景等)识别时,专用集成电路芯片101在对获取到的音频信号进行识别时,将从音频信号中提取出能够表征环境场景的特征信息,并根据提取出的特征信息 对当前所处的环境场景进行识别,得到用于描述当前环境场景类型的识别结果。For another example, when the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.), when the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.
103、专用集成电路芯片101发送指示识别操作完成的指示信息至中央处理器102。103. The application specific integrated circuit chip 101 sends instruction information indicating completion of the identification operation to the central processing unit 102.
在完成对音频信号的识别操作,并得到识别结果之后,专用集成电路芯片101发送指示识别操作完成的指示信息至中央处理器102,形象的说,该指示信息的作用在于告知中央处理器102,专用集成电路芯片101已经完成对音频信号的识别操作,可以从专用集成电路芯片101提取识别结果。其中,前述指示信息可以中断信号信号的形式发送。After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102. Visually speaking, the function of the instruction information is to inform the central processing unit 102, The ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101. The foregoing indication information may be sent in the form of an interrupt signal.
104、中央处理器102根据接收到的指示信息,从专用集成电路芯片101提取前述识别结果,并执行对应前述识别结果的目标操作。104. The central processing unit 102 extracts the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and performs a target operation corresponding to the foregoing recognition result.
相应的,中央处理器102在接收到来自专用集成电路芯片101的指示信息之后,根据该指示信息,从专用集成电路芯片101处提取专用集成电路芯片101对音频信号进行识别所得到的识别结果。Correspondingly, after receiving the instruction information from the application specific integrated circuit chip 101, the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.
在提取到音频信号的识别结果之后,中央处理器102进一步执行对应该识别结果的目标操作。After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.
比如,在专用集成电路芯片101被配置为性别识别时,若提取到“发音者为男”的识别结果,则将操作系统的主题模式切换为男性化的主题模式,若提取到“发音者为女”的识别结果,则将操作系统的主题模式切换为女性化的主题模式。For example, when the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male" is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female", the theme mode of the operating system is switched to a feminine theme mode.
又比如,在专用集成电路芯片101被配置为环境类型识别时,若提取到“办公室场景”的识别结果,则将操作系统的提示模式切换为静音模式,若提取到“公交车厢场景”的识别结果,则将操作系统的提示模式切换为振动+响铃模式等等。For another example, when the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene" As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.
由上可知,本申请实施例中的电子设备,首先由功耗较低的专用集成电路芯片101获取外部的音频信号,对获取到的音频信号进行识别操作,得到识别结果,并发送指示识别操作完成的指示信息至中央处理器102,再由中央处理器102根据指示信息,从专用集成电路芯片101提取识别结果,并执行对应识别结果的目标操作。由此,将中央处理器102的音频识别任务分担至功耗较低的专用集成电路芯片101完成,并由中央处理器102根据专用集成电路芯片101的识别结果执行对应的目标操作,通过这种专用集成电路协同中央处理器102进行对电子设备语音控制的方式,能够降低电子设备实现语音控制的功耗。It can be known from the above that the electronic device in the embodiment of the present application first obtains an external audio signal from the application-specific integrated circuit chip 101 with low power consumption, performs a recognition operation on the acquired audio signal, obtains a recognition result, and sends an instruction for the recognition operation. The completed instruction information is sent to the central processing unit 102, and the central processing unit 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and performs a target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101. The manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.
在一实施方式中,请参照图2,专用集成电路芯片101包括微控制单元1011、预处理单元1012以及算法单元1013,请参照图6,专用集成电路芯片101对获取到的音频信号进行识别操作,得到识别结果的步骤,包括:In an embodiment, please refer to FIG. 2. The ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013. Referring to FIG. 6, the ASIC chip 101 performs a recognition operation on the acquired audio signal. Steps to get recognition results, including:
1021、预处理单元1012根据微控制单元1011的控制,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数;1021, the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the control of the micro control unit 1011;
1022、算法单元1013根据微控制单元1011的控制,使用深度神经网络算法对梅尔频率倒谱系数进行关键词识别,得到候选关键词以及候选关键词的置信度。1022. The algorithm unit 1013 uses the deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficients according to the control of the micro control unit 1011, and obtains the candidate keywords and the confidence of the candidate keywords.
微控制单元1011首先通过麦克风获取到外部的音频信号,比如,在电子设备未外接麦 克风时,微控制单元1011可以通过电子设备内置的麦克风(图2未示出)对外部声音进行采集,得到外部的音频信号;又比如,在电子设备外接有麦克风时,微控制单元1011可以通过电子设备外接的麦克风对外部声音进行采集,得到外部的音频信号。The micro control unit 1011 first obtains an external audio signal through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sound through a built-in microphone (not shown in FIG. 2) of the electronic device to obtain an external audio signal. For example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.
其中,微控制单元1011在通过麦克风采集外部的音频信号时,若麦克风为模拟麦克风,将采集到模拟的音频信号,微控制单元1011需要对模拟的音频信号进行采样,将模拟的音频信号转换为数字化的音频信号,比如,可以16KHz的采样频率进行采样;此外,若麦克风为数字麦克风,微控制单元1011将通过数字麦克风直接采集到数字化的音频信号,无需进行转换。Among them, when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.
在获取到外部的音频信号之后,微控制单元1011生成第一控制信息,将该第一控制信息发送至预处理单元1012。After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.
预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,根据该第一控制信息,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。在提取到音频信号的梅尔频率倒谱系数之后,预处理单元1012发送第一反馈信息至微控制单元1011。After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.
微控制单元1011在接收到来自预处理单元1012的第一反馈信息之后,确定预处理单元1012当前已经提取到音频信号的梅尔频率倒谱系数,此时生成第二控制信息,After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,
算法单元1013在接收到来自微控制单元1011的第二控制信息之后,使用内置的深度神经网络算法,对前述梅尔频率倒谱系数进行关键词识别(关键词识别也即是检测音频信号对应的语音中是否出现预先定义的单词),得到候选关键词以及候选关键词的置信度。在完成关键词识别并识别得到候选关键词以及候选关键词的置信度之后,算法单元1013发送第二反馈信息至微控制单元1011。After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.
微控制单元1011在接收到来自算法单元1013的第二反馈信息之后,确定算法单元1013已经完成关键词识别,将算法单元1013识别得到的候选关键词以及候选关键词的置信度作为此次对音频信号进行识别操作的识别结果。After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.
此外,请参照图3,专用集成电路芯片101还包括内存1014,内存1014可以用于存储获取到的音频信号、识别出候选关键词、置信度以及预处理单元1012和算法单元1013在执行过程中产生的中间数据。In addition, referring to FIG. 3, the ASIC chip 101 further includes a memory 1014. The memory 1014 can be used to store the acquired audio signals, identify candidate keywords, confidence, and during the execution of the preprocessing unit 1012 and the algorithm unit 1013 Generated intermediate data.
比如,微控制单元1011将通过麦克风获取到的音频信号存储在内存1014中;预处理单元1012根据微控制单元1011的控制,使用梅尔频率倒谱系数算法提取内存1014中存储的音频信号的梅尔频率倒谱系数,并将提取出的梅尔频率倒谱系数存储在内存1014中;算法单元1013根据微控制单元1011的控制,使用内置的深度神经网络算法,对内存1014中存储的梅尔频率倒谱系数进行关键词识别,得到候选关键词以及候选关键词的置信度,将得到候选关键词以及候选关键词的置信度存储在内存1014中。For example, the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011. The frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.
请参照图4,专用集成电路芯片101还包括高速缓冲存储器1015,可以用于对存入内存1014的数据、从内存1014中取出的数据进行缓存。Referring to FIG. 4, the ASIC chip 101 further includes a cache memory 1015, which can be used to cache data stored in the memory 1014 and data retrieved from the memory 1014.
其中,高速缓冲存储器1015相较于内存1014其存储空间较小,但速度更高,通过高速缓冲存储器1015可以提升预处理单元1012以及算法单元1013的处理效率。Among them, the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed. The cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.
比如,预处理单元1012在对音频信号进行梅尔频率倒谱系数的提取时,当预处理单元1012直接从内存1014中存取数据时要等待一定时间周期,而高速缓冲存储器1015则可以保存预处理单元1012刚用过或循环使用的一部分数据,如果预处理单元1012需要再次使用该部分数据时可从高速缓冲存储器1015中直接调用,这样就避免了重复存取数据,减少了预处理单元1012的等待时间,从而提升了其处理效率。For example, when the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal, when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.
在一实施方式中,请参照图7,中央处理器102执行对应前述识别结果的目标操作的步骤,包括:In an embodiment, referring to FIG. 7, the central processing unit 102 executes a target operation corresponding to the foregoing recognition result, including:
1041、中央处理器102在候选关键词的置信度达到预设置信度时,将候选关键词作为音频信号的目标关键词;1041. The central processing unit 102 uses the candidate keywords as target keywords of the audio signal when the confidence of the candidate keywords reaches a preset confidence level;
1042、中央处理器102根据预设的关键词和预设操作的对应关系,将对应目标关键词的预设操作确定为目标操作,并执行该目标操作。1042. The central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation, and executes the target operation.
其中,中央处理器102在根据专用集成电路芯片101的指示信息,从专用集成电路芯片101提取到识别出的“候选关键词以及候选关键词的置信度”之后,首先判断候选关键词的置信度是否达到预设置信度(可由本领域技术人员根据实际需要取合适值,比如,可以设置为90%)Among them, the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)
在完成对置信度的判断,且候选关键词的置信度达到预设置信度时,中央处理器102将候选关键词作为音频信号的目标关键词。After the judgment of the confidence level is completed, and the confidence level of the candidate keywords reaches a preset confidence level, the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.
之后,中央处理器102根据预设的关键词和预设操作的对应关系,将对应目标关键词的预设操作确定为目标操作。其中,关键词和预设操作的对应关系可根据实际需要进行设置,比如,可以设置关键词“小欧,小欧”对应的预设操作为“唤醒操作系统”,这样,当目标关键词为“小欧,小欧”时,若操作系统当前处于休眠状态,中央处理器102将唤醒操作系统。After that, the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation. The corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword "Little Europe, Little Europe" can be set to "wake the operating system", so when the target keyword is In "Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.
在一实施方式中,预处理单元1012在使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数的步骤之前,还包括:In an implementation manner, before the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:
(1)预处理单元1012对音频信号进行预处理;(1) The pre-processing unit 1012 pre-processes the audio signal;
(2)预处理单元1012在完成对音频信号的预处理之后,使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。(2) After the preprocessing unit 1012 finishes preprocessing the audio signal, it uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.
其中,预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,首先对对音频信号进行预加重和加窗等预处理。Among them, after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.
其中,预加重也即是增加音频信号高频部分的能量。对于音频信号的频谱来说,往往低频部分的能量高于高频部分的能量,每经过10倍Hz,频谱能量就会衰减20dB,而且由于麦克风在采集音频信号时电路本底噪声的影响,也会增加低频部分的能量,为使高频部 分的能量和低频部分能量有相似的幅度,需要预加强采集到音频信号的高频能量。Among them, pre-emphasis is to increase the energy of the high-frequency part of the audio signal. For the frequency spectrum of audio signals, the energy in the low-frequency part is often higher than the energy in the high-frequency part. The spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part. In order to make the energy of the high frequency part and the energy of the low frequency part have a similar amplitude, it is necessary to pre-enhance the high frequency energy of the collected audio signal.
由于音频信号一般是非平稳信号,其统计特性不是固定不变的,但在一段相当短的时间内,可以认为信号时平稳的,这就是加窗。窗由三个参数来描述:窗长(单位毫秒)、偏移和形状。每一个加窗的音频信号叫做一帧,每一帧的毫秒数叫做帧长,相邻两帧左边界的距离叫帧移。本申请实施例中,可以使用边缘平滑降到0的汉明窗进行加窗处理。Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing. The window is described by three parameters: window length (in milliseconds), offset, and shape. Each windowed audio signal is called a frame, the milliseconds of each frame is called the frame length, and the distance between the left borders of two adjacent frames is called the frame shift. In the embodiment of the present application, a Hamming window with edge smoothing down to 0 may be used for windowing processing.
在完成对音频信号的预处理之后,预处理单元1012即可使用梅尔频率倒谱系数算法提取音频信号的梅尔频率倒谱系数。其中,预处理单元1012提取梅尔频率倒谱系数的过程大致为:利用人耳听觉的非线性特性,将音频信号的频谱转换为基于梅尔频率的非线性频谱,再转换到倒谱域,由此得到梅尔频率倒谱系数。After pre-processing the audio signal, the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal. The process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.
在一实施方式中,预处理单元1012对音频信号进行预处理的步骤之前,还包括:In an embodiment, before the step of preprocessing the audio signal by the preprocessing unit 1012, the method further includes:
(1)预处理单元1012提取音频信号的声纹特征;(1) The preprocessing unit 1012 extracts the voiceprint features of the audio signal;
(2)预处理单元1012判断提取出的声纹特征是否与预设声纹特征匹配;(2) The preprocessing unit 1012 determines whether the extracted voiceprint features match the preset voiceprint features;
(3)预处理单元1012在提取出的声纹特征与预设声纹特征匹配时,对前述音频信号进行预处理。(3) The pre-processing unit 1012 pre-processes the aforementioned audio signal when the extracted voiceprint features match the preset voiceprint features.
需要说明的是,在实际生活中,每个人说话时的声音都有自己的特点,熟悉的人之间,可以只听声音而相互辨别出来。这种声音的特点就是声纹特征,声纹特征主要由两个因素决定,第一个是声腔的尺寸,具体包括咽喉、鼻腔和口腔等,这些器官的形状、尺寸和位置决定了声带张力的大小和声音频率的范围。因此不同的人虽然说同样的话,但是声音的频率分布是不同的,听起来有的低沉有的洪亮。It should be noted that, in actual life, everyone's voice when speaking has its own characteristics. Familiar people can distinguish each other only by listening to the voice. The characteristic of this sound is the voiceprint feature. The voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
第二个决定声纹特征的因素是发声器官被操纵的方式,发声器官包括唇、齿、舌、软腭及腭肌肉等,他们之间相互作用就会产生清晰的语音。而他们之间的协作方式是人通过后天与周围人的交流中随机学习到的。人在学习说话的过程中,通过模拟周围不同人的说话方式,就会逐渐形成自己的声纹特征。The second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated. The vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
其中,预处理单元1012在接收到来自微控制单元1011的第一控制信息之后,首先提取音频信号的声纹特征。Wherein, after receiving the first control information from the micro control unit 1011, the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.
在获取到语音信息的声纹特征之后,预处理单元1012进一步将获取到的该声纹特征与预设声纹特征进行进行比对,以判断该声纹特征是否与预设声纹特征匹配。其中,预设声纹特征可以为机主预先录入的声纹特征,判断获取的音频信号的声纹特征是否与预设声纹特征匹配,也即是判断音频信号的发音者是否为机主。After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature. The preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.
在获取到的声纹特征与预设声纹特征匹配时,预处理单元1012确定音频信号的发音者为机主,此时进一步对音频信号进行预处理,并提取出梅尔频率倒谱系数,具体可参照以上相关描述,此处不再赘述。When the acquired voiceprint features match the preset voiceprint features, the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient. For details, refer to the related descriptions above, and details are not described herein again.
在一实施方式中,预处理单元1012判断提取出的声纹特征是否与预设声纹特征匹配的步骤,包括:In an embodiment, the step of the pre-processing unit 1012 determining whether the extracted voiceprint features match the preset voiceprint features includes:
(1)预处理单元1012获取前述声纹特征和预设声纹特征的相似度;(1) The preprocessing unit 1012 obtains the similarity between the aforementioned voiceprint feature and the preset voiceprint feature;
(2)预处理单元1012判断获取到的相似度是否大于或等于第一预设相似度;(2) The preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity;
(3)预处理单元1012在获取到的相似度大于或等于第一预设相似度时,确定获取到的声纹特征与预设声纹特征匹配。(3) When the obtained similarity is greater than or equal to the first preset similarity, the preprocessing unit 1012 determines that the obtained voiceprint feature matches the preset voiceprint feature.
其中,预处理单元1012在判断获取到的声纹特征是否与预设声纹特征匹配时,可以获取该声纹特征(即从前述音频信号所获取到的声纹特征)与预设声纹特征的相似度,并判断获取到的相似度是否大于或等于第一预设相似度(根据实际需要进行设置,比如,可以设置为95%)。若获取到的相似度大于或等于第一预设相似度,则确定获取到的声纹特征与预设声纹特征匹配;若获取到的相似度小于低于相似度,则确定获取到的声纹特征与预设声纹特征不匹配。Wherein, the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.
此外,在获取到的声纹特征与预设声纹特征不匹配时,预处理单元1012确定当前音频信号的发音者不为机主,发送第三反馈信息至微控制单元1011。In addition, when the acquired voiceprint features do not match the preset voiceprint features, the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.
微控制单元1011在接收到来自预处理单元1012的第三反馈信息之后,删除获取到的音频信号,并继续获取外部的音频信号,直至获取到机主的音频信号时,才对该音频信号进行预处理以及梅尔频率倒谱系数的提取,其中,对于如何进行预处理以及梅尔频率倒谱系数的提取,可参照以上实施例的相关描述,此处不再赘述。After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained. For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.
由此,通过这种基于声纹特征对发音者进行身份认证的方式,仅对机主发出的音频信号进行响应,能够避免执行非机主意愿的操作,可以提升机主的使用体验。Therefore, by using the voiceprint feature to authenticate the speaker based on the voiceprint feature, only responding to the audio signal sent by the owner can avoid performing operations that are not intended by the owner, and can improve the user experience.
在一实施方式中,预处理单元1012判断获取到的相似度是否大于或等于第一预设相似度的步骤之后,还包括:In an embodiment, after the preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity, the method further includes:
(1)预处理单元1012在前述相似度小于第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;(1) The pre-processing unit 1012 obtains the current position information when the aforementioned similarity is less than the first preset similarity and greater than or equal to the second preset similarity;
(2)预处理单元1012根据获取到的位置信息判断当前是否位于预设位置范围内;(2) The pre-processing unit 1012 determines whether it is currently within a preset position range according to the obtained position information;
(3)预处理单元1012在当前位于预设位置范围内时,确定前述声纹特征与所述预设声纹特征匹配。(3) When the pre-processing unit 1012 is currently located within a preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.
需要说明的是,由于声纹特征和人体的生理特征密切相关,在日常生活中,如果用户感冒发炎的话,其声音将变得沙哑,声纹特征也将随之发生变化。在这种情况下,即使获取到的音频信号由机主说出,预处理单元1012也将无法识别出。此外,还存在多种导致预处理单元1012法识别出机主的情况,此处不再赘述。It should be noted that, because the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.
为解决可能出现的、无法识别出机主的情况,预处理单元1012在完成对声纹特征相似度的判断之后,若获取到的声纹特征与预设声纹特征的相似度小于第一预设相似度,则进一步判断该声纹特征是否大于或等于第二预设相似度(该第二预设相似度配置为小于第一预设相似度,具体可由本领域技术人员根据实际需要取合适值,比如,在第一预设相似度被设置为95%时,可以将第二预设相似度设置为75%)。In order to solve the possible situation where the owner cannot be identified, after the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
在判断结果为是,也即是获取到的声纹特征与预设声纹特征的相似度小于第一预设相似度且大于或等于第二预设相似度时,预处理单元1012进一步获取到当前的位置信息。其中,预处理单元1012可以发送位置获取请求至电子设备的定位模组(可以采用卫星定位技术或者基站定位技术等不同的定位技术来获取到当前的位置信息),指示定位模组返回当前的位置信息。When the judgment result is yes, that is, the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset similarity and greater than or equal to the second preset similarity, the preprocessing unit 1012 further obtains Current location information. The pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.
在获取到当前的位置信息之后,预处理单元1012根据该位置信息判断当前是否位于预设位置范围内。其中,预设位置范围可以配置为机主的常用位置范围,比如家里和公司等。After acquiring the current position information, the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information. Among them, the preset position range can be configured as a common position range of the owner, such as home and company.
在当前位于预设位置范围内时,预处理单元1012确定获取到的声纹特征与预设声纹特征匹配,将音频信号的发音者识别为机主。When it is currently within the preset position range, the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.
以上对本申请实施例所提供的一种电子设备及设备控制方法进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The electronic device and the device control method provided by the embodiments of the present application have been described in detail above. Specific examples are used in this document to explain the principles and implementation of the present application. The description of the above embodiments is only used to help understand the present invention. The application method and its core ideas; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be understood as Restrictions on this application.

Claims (20)

  1. 一种电子设备,其中,所述电子设备包括专用集成电路芯片和中央处理器,且所述专用集成电路芯片的功耗小于所述中央处理器的功耗,其中,An electronic device, wherein the electronic device includes an application specific integrated circuit chip and a central processing unit, and the power consumption of the application specific integrated circuit chip is less than the power consumption of the central processing unit, wherein:
    所述专用集成电路芯片用于获取外部的音频信号,对所述音频信号进行识别操作,得到识别结果,并发送指示识别操作完成的指示信息至所述中央处理器;The application-specific integrated circuit chip is configured to obtain an external audio signal, perform an identification operation on the audio signal, obtain an identification result, and send instruction information indicating completion of the identification operation to the central processing unit;
    所述中央处理器用于根据所述指示信息,从所述专用集成电路芯片提取所述识别结果,并执行对应所述识别结果的目标操作。The central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.
  2. 如权利要求1所述的电子设备,其中,所述专用集成电路芯片包括微控制单元、预处理单元以及算法单元,其中,The electronic device according to claim 1, wherein the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit, wherein:
    所述预处理单元用于根据所述微控制单元的控制,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数;The preprocessing unit is configured to extract a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;
    所述算法单元用于根据所述微控制单元的控制,使用深度神经网络算法对所述梅尔频率倒谱系数进行关键词识别,得到候选关键词以及所述候选关键词的置信度。The algorithm unit is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit to obtain candidate keywords and the confidence of the candidate keywords.
  3. 如权利要求2所述的电子设备,其中,所述中央处理器还用于在所述置信度达到预设置信度时,将所述候选关键词作为所述音频信号的目标关键词,根据预设的关键词和预设操作的对应关系,将对应所述目标关键词的预设操作确定为所述目标操作,并执行所述目标操作。The electronic device according to claim 2, wherein the central processing unit is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence reaches a preset confidence, Set the correspondence between the keywords and the preset operations, determine the preset operation corresponding to the target keyword as the target operation, and execute the target operation.
  4. 如权利要求2所述的电子设备,其中,所述专用集成电路芯片还包括内存,用于存储所述音频信号、所述候选关键词、所述置信度以及所述预处理单元和所述算法单元在执行过程中产生的中间数据。The electronic device according to claim 2, wherein the application-specific integrated circuit chip further comprises a memory for storing the audio signal, the candidate keywords, the confidence level, the pre-processing unit, and the algorithm Intermediate data generated by the unit during execution.
  5. 如权利要求4所述电子设备,其中,所述专用集成电路芯片还包括高速缓冲存储器,用于对存入所述内存的数据、从所述内存中取出的数据进行缓存。The electronic device according to claim 4, wherein the application-specific integrated circuit chip further comprises a cache memory for buffering data stored in the memory and data fetched from the memory.
  6. 如权利要求2所述的电子设备,其中,所述预处理单元还用于对所述音频信号进行预处理,在完成对所述音频信号的预处理之后,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数。The electronic device according to claim 2, wherein the pre-processing unit is further configured to pre-process the audio signal, and after pre-processing the audio signal, extract it using a Mel frequency cepstrum coefficient algorithm Mel frequency cepstrum coefficient of the audio signal.
  7. 如权利要求6所述的电子设备,其中,所述预处理单元还用于提取所述音频信号的声纹特征,判断所述声纹特征是否与预设声纹特征匹配,并在所述声纹特征与所述预设声纹特征匹配时,对所述音频信号进行预处理。The electronic device according to claim 6, wherein the pre-processing unit is further configured to extract the voiceprint feature of the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and When the texture feature matches the preset voiceprint feature, the audio signal is pre-processed.
  8. 如权利要求7所述的电子设备,其中,所述预处理单元还用于获取所述声纹特征和所述预设声纹特征的相似度,判断所述相似度是否大于或等于第一预设相似度,并在所述 相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹特征匹配。The electronic device according to claim 7, wherein the preprocessing unit is further configured to obtain a similarity between the voiceprint feature and the preset voiceprint feature, and determine whether the similarity is greater than or equal to a first preset A similarity is set, and when the similarity is greater than or equal to the first preset similarity, it is determined that the voiceprint feature matches the preset voiceprint feature.
  9. 如权利要求8所述的电子设备,其中,所述预处理单元还用于在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息,根据所述位置信息判断当前是否位于预设位置范围内,并在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。The electronic device according to claim 8, wherein the preprocessing unit is further configured to obtain a current position when the similarity is less than the first preset similarity and greater than or equal to a second preset similarity. Information, to determine whether it is currently within a preset position range according to the position information, and when it is currently within a preset position range, determine that the voiceprint feature matches the preset voiceprint feature.
  10. 如权利要求7所述的电子设备,其中,所述预处理单元还用于在所述声纹特征与所述预设声纹特征不匹配时,指示所述微控制单元删除所述音频信号。The electronic device according to claim 7, wherein the pre-processing unit is further configured to instruct the micro-control unit to delete the audio signal when the voiceprint feature does not match the preset voiceprint feature.
  11. 一种设备控制方法,应用于电子设备,其中,所述电子设备包括中央处理器和专用集成电路芯片,且所述专用集成电路芯片的功耗小于所述中央处理器的功耗,所述设备控制方法包括:A device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the power consumption of the central processor, and the device Control methods include:
    所述专用集成电路芯片获取外部的音频信号;The application-specific integrated circuit chip acquires an external audio signal;
    所述专用集成电路芯片对所述音频信号进行识别,得到识别结果;The application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result;
    所述专用集成电路芯片发送识别完成的指示信息至所述中央处理器;The application-specific integrated circuit chip sends identification completion indication information to the central processing unit;
    所述中央处理器根据所述指示信息,从所述专用集成电路芯片提取所述识别结果,并执行对应所述识别结果的目标操作。The central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
  12. 如权利要求11所述的设备控制方法,其中,所述专用集成电路芯片包括微控制单元、预处理单元以及算法单元,所述专用集成电路芯片对所述音频信号进行识别,得到识别结果,包括:The device control method according to claim 11, wherein the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit, and the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result, including: :
    所述预处理单元根据所述微控制单元的控制,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数;The preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;
    所述算法单元根据所述微控制单元的控制,使用深度神经网络算法对所述梅尔频率倒谱系数进行关键词识别,得到候选关键词以及所述候选关键词的置信度作为所述识别结果。According to the control of the micro control unit, the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .
  13. 如权利要求12所述的设备控制方法,其中,所述执行对应所述识别结果的目标操作包括:The device control method according to claim 12, wherein the performing a target operation corresponding to the recognition result comprises:
    中央处理器在所述置信度达到预设置信度时,将所述候选关键词作为所述音频信号的目标关键词,根据预设的关键词和预设操作的对应关系,将对应所述目标关键词的预设操作确定为所述目标操作,并执行所述目标操作。When the confidence level reaches a preset confidence level, the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target. The preset operation of the keyword is determined as the target operation, and the target operation is performed.
  14. 如权利要求12所述的设备控制方法,其中,所述专用集成电路芯片还包括内存,所述设备控制方法还包括:The device control method according to claim 12, wherein the application-specific integrated circuit chip further comprises a memory, and the device control method further comprises:
    所述内存存储所述音频信号、所述候选关键词、所述置信度以及所述预处理单元和所 述算法单元在执行过程中产生的中间数据。The memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.
  15. 如权利要求14所述设备控制方法,其中,所述专用集成电路芯片还包括高速缓冲存储器,所述设备控制方法还包括:The device control method according to claim 14, wherein the application-specific integrated circuit chip further comprises a cache memory, and the device control method further comprises:
    所述高速缓冲存储器对存入所述内存的数据、从所述内存中取出的数据进行缓存。The cache memory caches data stored in the memory and data fetched from the memory.
  16. 如权利要求12所述的设备控制方法,其中,在使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数之前,还包括:The device control method according to claim 12, wherein before extracting a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm, further comprising:
    所述预处理单元对所述音频信号进行预处理,在完成对所述音频信号的预处理之后,使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数。The pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.
  17. 如权利要求16所述的设备控制方法,其中,在使用梅尔频率倒谱系数算法提取所述音频信号的梅尔频率倒谱系数之前,还包括:The device control method according to claim 16, before the extracting a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm, further comprising:
    所述预处理单元提取所述音频信号的声纹特征,判断所述声纹特征是否与预设声纹特征匹配,并在所述声纹特征与所述预设声纹特征匹配时,对所述音频信号进行预处理。The pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.
  18. 如权利要求17所述的设备控制方法,其中,所述判断所述声纹特征是否与预设声纹特征匹配,包括:The device control method according to claim 17, wherein the determining whether the voiceprint feature matches a preset voiceprint feature comprises:
    所述预处理单元获取所述声纹特征和所述预设声纹特征的相似度,判断所述相似度是否大于或等于第一预设相似度,并在所述相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹特征匹配。The preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.
  19. 如权利要求18所述的设备控制方法,其中,还包括:The device control method according to claim 18, further comprising:
    所述预处理单元在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息,根据所述位置信息判断当前是否位于预设位置范围内,并在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。When the similarity is smaller than the first preset similarity and greater than or equal to a second preset similarity, the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.
  20. 如权利要求17所述的设备控制方法,其中,还包括:The device control method according to claim 17, further comprising:
    所述预处理单元在所述声纹特征与所述预设声纹特征不匹配时,指示所述微控制单元删除所述音频信号。When the voiceprint feature does not match the preset voiceprint feature, the preprocessing unit instructs the micro-control unit to delete the audio signal.
PCT/CN2019/085554 2018-06-08 2019-05-05 Electronic device and device control method WO2019233228A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810589643.2 2018-06-08
CN201810589643.2A CN108711429B (en) 2018-06-08 2018-06-08 Electronic device and device control method

Publications (1)

Publication Number Publication Date
WO2019233228A1 true WO2019233228A1 (en) 2019-12-12

Family

ID=63871448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085554 WO2019233228A1 (en) 2018-06-08 2019-05-05 Electronic device and device control method

Country Status (2)

Country Link
CN (1) CN108711429B (en)
WO (1) WO2019233228A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108711429B (en) * 2018-06-08 2021-04-02 Oppo广东移动通信有限公司 Electronic device and device control method
CN109636937A (en) * 2018-12-18 2019-04-16 深圳市沃特沃德股份有限公司 Voice Work attendance method, device and terminal device
CN110223687B (en) * 2019-06-03 2021-09-28 Oppo广东移动通信有限公司 Instruction execution method and device, storage medium and electronic equipment
CN110310645A (en) * 2019-07-02 2019-10-08 上海迥灵信息技术有限公司 Sound control method, device and the storage medium of intelligence control system
CN111508475B (en) * 2020-04-16 2022-08-09 五邑大学 Robot awakening voice keyword recognition method and device and storage medium
CN113744117A (en) * 2020-05-29 2021-12-03 Oppo广东移动通信有限公司 Multimedia processing chip, electronic equipment and dynamic image processing method
CN113352987B (en) * 2021-05-31 2022-10-25 亿咖通(湖北)技术有限公司 Method and system for controlling warning tone of vehicle machine
CN115527373B (en) * 2022-01-05 2023-07-14 荣耀终端有限公司 Riding tool identification method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700368A (en) * 2014-01-13 2014-04-02 联想(北京)有限公司 Speech recognition method, speech recognition device and electronic equipment
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
US20140372112A1 (en) * 2013-06-18 2014-12-18 Microsoft Corporation Restructuring deep neural network acoustic models
CN105488227A (en) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 Electronic device and method for processing audio file based on voiceprint features through same
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN106560891A (en) * 2015-10-06 2017-04-12 三星电子株式会社 Speech Recognition Apparatus And Method With Acoustic Modelling
CN107735803A (en) * 2015-06-25 2018-02-23 微软技术许可有限责任公司 Bandwidth of memory management for deep learning application
US10089979B2 (en) * 2014-09-16 2018-10-02 Electronics And Telecommunications Research Institute Signal processing algorithm-integrated deep neural network-based speech recognition apparatus and learning method thereof
CN108711429A (en) * 2018-06-08 2018-10-26 Oppo广东移动通信有限公司 Electronic equipment and apparatus control method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005181510A (en) * 2003-12-17 2005-07-07 Toshiba Corp Ic voice repeater
CN102905029A (en) * 2012-10-17 2013-01-30 广东欧珀移动通信有限公司 Mobile phone and method for looking for mobile phone through intelligent voice
CN103474071A (en) * 2013-09-16 2013-12-25 重庆邮电大学 Embedded portable voice controller and intelligent housing system with voice recognition
CN105575395A (en) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 Voice wake-up method and apparatus, terminal, and processing method thereof
CN106940998B (en) * 2015-12-31 2021-04-16 阿里巴巴集团控股有限公司 Execution method and device for setting operation
CN106250751B (en) * 2016-07-18 2019-09-17 青岛海信移动通信技术股份有限公司 A kind of mobile device and the method for adjusting sign information detection threshold value

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372112A1 (en) * 2013-06-18 2014-12-18 Microsoft Corporation Restructuring deep neural network acoustic models
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN103700368A (en) * 2014-01-13 2014-04-02 联想(北京)有限公司 Speech recognition method, speech recognition device and electronic equipment
US10089979B2 (en) * 2014-09-16 2018-10-02 Electronics And Telecommunications Research Institute Signal processing algorithm-integrated deep neural network-based speech recognition apparatus and learning method thereof
CN107735803A (en) * 2015-06-25 2018-02-23 微软技术许可有限责任公司 Bandwidth of memory management for deep learning application
CN106560891A (en) * 2015-10-06 2017-04-12 三星电子株式会社 Speech Recognition Apparatus And Method With Acoustic Modelling
CN105488227A (en) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 Electronic device and method for processing audio file based on voiceprint features through same
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN108711429A (en) * 2018-06-08 2018-10-26 Oppo广东移动通信有限公司 Electronic equipment and apparatus control method

Also Published As

Publication number Publication date
CN108711429B (en) 2021-04-02
CN108711429A (en) 2018-10-26

Similar Documents

Publication Publication Date Title
WO2019233228A1 (en) Electronic device and device control method
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN110021307B (en) Audio verification method and device, storage medium and electronic equipment
CN109272991B (en) Voice interaction method, device, equipment and computer-readable storage medium
CN108922525B (en) Voice processing method, device, storage medium and electronic equipment
CN105206271A (en) Intelligent equipment voice wake-up method and system for realizing method
KR20160005050A (en) Adaptive audio frame processing for keyword detection
CN110223687B (en) Instruction execution method and device, storage medium and electronic equipment
CN111145763A (en) GRU-based voice recognition method and system in audio
WO2014173325A1 (en) Gutturophony recognition method and device
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
WO2023088083A1 (en) Speech enhancement method and apparatus
CN114067782A (en) Audio recognition method and device, medium and chip system thereof
CN115206306A (en) Voice interaction method, device, equipment and system
WO2017177629A1 (en) Far-talking voice recognition method and device
WO2022007846A1 (en) Speech enhancement method, device, system, and storage medium
US11290802B1 (en) Voice detection using hearable devices
WO2019041871A1 (en) Voice object recognition method and device
WO2022199405A1 (en) Voice control method and apparatus
CN108337620A (en) A kind of loudspeaker and its control method of voice control
CN208337877U (en) A kind of loudspeaker of voice control
CN114664303A (en) Continuous voice instruction rapid recognition control system
CN112118511A (en) Earphone noise reduction method and device, earphone and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815256

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815256

Country of ref document: EP

Kind code of ref document: A1