WO2019233228A1

WO2019233228A1 - Electronic device and device control method

Info

Publication number: WO2019233228A1
Application number: PCT/CN2019/085554
Authority: WO
Inventors: 陈岩
Original assignee: Oppo广东移动通信有限公司
Priority date: 2018-06-08
Filing date: 2019-05-05
Publication date: 2019-12-12
Also published as: CN108711429B; CN108711429A

Abstract

An electronic device and a device control method. The electronic device comprises a central processing unit and an application-specific integrated circuit chip. The method comprises: the application-specific integrated circuit chip acquires an external audio signal (101); the application-specific integrated circuit chip performs an identification operation on the acquired audio signal to obtain an identification result (102); the application-specific integrated circuit chip sends indication information indicating that the identification operation is completed to the central processing unit (103); the central processing unit extracts the identification result from the application-specific integrated circuit chip according to the received indication information, and performs a target operation corresponding to the identification result (104).

Description

Electronic equipment and equipment control method

This application claims priority from a Chinese patent application filed with the Chinese Patent Office on June 08, 2018, with an application number of 201810589643.2 and an invention name of "Electronic Equipment and Equipment Control Methods", the entire contents of which are incorporated herein by reference.

Technical field

The present application relates to the technical field of electronic devices, and in particular, to an electronic device and a device control method.

Background technique

At present, the application of voice recognition technology in electronic devices is becoming more and more widespread. Using voice recognition technology, voice control of electronic devices can be achieved. For example, users can speak specific voice instructions to control electronic devices to take pictures and play music.

Summary of the Invention

In a first aspect, an embodiment of the present application provides an electronic device that includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than the power consumption of the central processor.

The application-specific integrated circuit chip is configured to obtain an external audio signal;

The application-specific integrated circuit chip is further configured to perform a recognition operation on the audio signal to obtain a recognition result;

The application-specific integrated circuit chip is further configured to send instruction information indicating completion of the identification operation to the central processing unit;

The central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.

In a second aspect, an embodiment of the present application provides a method for controlling a device, which is applied to an electronic device. The electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the central processing unit. Processor power consumption. The device control method includes:

The application-specific integrated circuit chip acquires an external audio signal;

The application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result;

The application-specific integrated circuit chip sends identification completion indication information to the central processing unit;

The central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those skilled in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a first schematic structural diagram of an electronic device according to an embodiment of the present application.

FIG. 2 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.

FIG. 3 is a third schematic structural diagram of an electronic device according to an embodiment of the present application.

FIG. 4 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present application.

FIG. 5 is a schematic flowchart of a device control method according to an embodiment of the present application.

FIG. 6 is a detailed flowchart of identifying an audio signal by an application specific integrated circuit chip in the embodiment of the present application.

FIG. 7 is a detailed flowchart of a target operation performed by a central processing unit according to an embodiment of the present application.

Detailed ways

It should be understood that a reference to "an embodiment" herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

An embodiment of the present application provides an electronic device. Referring to FIG. 1, the electronic device 100 includes an application-specific integrated circuit chip 101 and a central processing unit 102, and the power consumption of the application-specific integrated circuit chip 101 is less than the power consumption of the central processing unit 102.

The ASIC chip 101 is used to obtain an external audio signal, perform a recognition operation on the acquired audio signal, obtain a recognition result, and send instruction information indicating completion of the recognition operation to the central processing unit 102.

It should be noted that the ASIC chip 101 in the embodiment of the present application is an ASIC designed for the purpose of audio recognition. Compared with a general-purpose central processing unit 102, the ASIC chip has higher audio recognition efficiency and lower Power consumption. The ASIC chip 101 and the central processing unit 102 establish a data communication connection through a communication bus

The application-specific integrated circuit chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the application-specific integrated circuit chip 101 can be paired with a built-in microphone (not shown in FIG. 1) of the electronic device. The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.

Among them, when the ASIC chip 101 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The ASIC chip 101 needs to sample the analog audio signal and convert the analog audio signal. The digital audio signal can be converted into a digital audio signal, for example, it can be sampled at a sampling frequency of 16KHz. In addition, if the microphone is a digital microphone, the ASIC chip 101 will directly collect the digital audio signal through the digital microphone without conversion.

After obtaining an external audio signal, the application specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal according to a pre-configured recognition mode to obtain a recognition result.

For example, when the recognition mode of the ASIC chip 101 is configured as gender recognition, when the ASIC chip 101 recognizes the acquired audio signal, it extracts feature information capable of characterizing the gender from the audio signal, and according to The extracted feature information recognizes the gender of the speaker of the audio signal and obtains the recognition result of whether the speaker is male or female.

For another example, when the recognition mode of the application-specific integrated circuit chip 101 is configured to identify the environment type (a subway car scene, a bus carriage scene, an office scene, etc.), when the application-specific integrated circuit chip 101 recognizes the acquired audio signal, it will Feature information capable of characterizing the environment scene is extracted from the audio signal, and the current environment scene is identified based on the extracted feature information, and a recognition result describing the type of the current environment scene is obtained.

After completing the recognition operation of the audio signal and obtaining the recognition result, the application-specific integrated circuit chip 101 sends instruction information indicating the completion of the recognition operation to the central processing unit 102. Visually speaking, the function of the instruction information is to inform the central processing unit 102, The ASIC chip 101 has completed the recognition operation of the audio signal, and the recognition result can be extracted from the ASIC chip 101. The foregoing indication information may be sent in the form of an interrupt signal.

The central processing unit 102 is configured to extract the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and execute a target operation corresponding to the foregoing recognition result.

Correspondingly, after receiving the instruction information from the application specific integrated circuit chip 101, the central processing unit 102 extracts the recognition result obtained by identifying the audio signal from the application specific integrated circuit chip 101 from the application specific integrated circuit chip 101 according to the instruction information.

After the recognition result of the audio signal is extracted, the central processing unit 102 further performs a target operation corresponding to the recognition result.

For example, when the application-specific integrated circuit chip 101 is configured for gender recognition, if the recognition result of "the speaker is male" is extracted, the theme mode of the operating system is switched to the masculine theme mode. "Female", the theme mode of the operating system is switched to a feminine theme mode.

For another example, when the application-specific integrated circuit chip 101 is configured for environment type recognition, if the recognition result of the "office scene" is extracted, the prompt mode of the operating system is switched to the silent mode, and if it is extracted to the recognition of the "bus scene" As a result, the prompt mode of the operating system is switched to a vibration + ringing mode and the like.

It can be known from the above that the electronic device in the embodiment of the present application includes a central processing unit 102 and an application-specific integrated circuit chip 101. First, the application-specific integrated circuit chip 101 with low power consumption obtains external audio signals, and performs identification operations on the acquired audio signals. To obtain the recognition result, and send the instruction information indicating the completion of the recognition operation to the central processor 102, and the central processor 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and executes the target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101. The manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.

In an embodiment, please refer to FIG. 2. The ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013.

The pre-processing unit 1012 is configured to extract the Mel frequency cepstrum coefficient of the audio signal using the Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit 1011;

The algorithm unit 1013 is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit 1011 to obtain candidate keywords and the confidence of the candidate keywords.

The micro control unit 1011 first obtains external audio signals through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sounds through a built-in microphone (not shown in FIG. 2) of the electronic device. An external audio signal is obtained. For another example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.

Among them, when the micro control unit 1011 collects external audio signals through the microphone, if the microphone is an analog microphone, the analog audio signal will be collected. The micro control unit 1011 needs to sample the analog audio signal and convert the analog audio signal into The digitized audio signal can be sampled at a sampling frequency of 16KHz; in addition, if the microphone is a digital microphone, the micro control unit 1011 will directly collect the digitized audio signal through the digital microphone without conversion.

After obtaining an external audio signal, the micro control unit 1011 generates first control information, and sends the first control information to the pre-processing unit 1012.

After receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the first control information. After extracting the Mel frequency cepstrum coefficient of the audio signal, the pre-processing unit 1012 sends the first feedback information to the micro control unit 1011.

After receiving the first feedback information from the pre-processing unit 1012, the micro-control unit 1011 determines a Mel frequency cepstrum coefficient that the pre-processing unit 1012 has currently extracted the audio signal, and generates second control information at this time,

After receiving the second control information from the micro control unit 1011, the algorithm unit 1013 uses the built-in deep neural network algorithm to perform keyword recognition on the aforementioned Mel frequency cepstrum coefficients (keyword recognition is to detect the corresponding audio signal). Whether a predefined word appears in the speech), to obtain candidate keywords and the confidence of the candidate keywords. After the keyword recognition is completed and the candidate keywords and the confidence of the candidate keywords are obtained, the algorithm unit 1013 sends the second feedback information to the micro control unit 1011.

After receiving the second feedback information from the algorithm unit 1013, the micro control unit 1011 determines that the algorithm unit 1013 has completed keyword recognition, and uses the candidate keywords identified by the algorithm unit 1013 and the confidence level of the candidate keywords as the audio for this time. Signals the recognition result of the recognition operation.

In an embodiment, please refer to FIG. 3. The ASIC chip 101 further includes a memory 1014 for storing the acquired audio signals, identifying candidate keywords, confidence levels, and the preprocessing unit 1012 and the algorithm unit 1013 during the execution process. Intermediate data generated in.

For example, the micro control unit 1011 stores the audio signal obtained through the microphone in the memory 1014; the pre-processing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mei of the audio signal stored in the memory 1014 according to the control of the micro control unit 1011. Frequency cepstrum coefficient, and the extracted Mel frequency cepstrum coefficient is stored in the memory 1014; the algorithm unit 1013 uses the built-in deep neural network algorithm to control the Mel stored in the memory 1014 according to the control of the micro control unit 1011. The frequency cepstrum coefficient is used for keyword recognition to obtain candidate keywords and the confidence of the candidate keywords, and the obtained candidate keywords and the confidence of the candidate keywords are stored in the memory 1014.

In an embodiment, please refer to FIG. 4. The ASIC chip 101 further includes a cache memory 1015 for buffering data stored in the memory 1014 and data retrieved from the memory 1014.

Among them, the cache memory 1015 has a smaller storage space than the memory 1014, but has a higher speed. The cache memory 1015 can improve the processing efficiency of the preprocessing unit 1012 and the algorithm unit 1013.

For example, when the pre-processing unit 1012 extracts Mel frequency cepstrum coefficients from the audio signal, when the pre-processing unit 1012 directly accesses data from the memory 1014, it needs to wait for a certain period of time, while the cache memory 1015 can save the pre-processing unit. A part of the data that the processing unit 1012 has just used or recycled. If the preprocessing unit 1012 needs to use the part of the data again, it can be directly called from the cache memory 1015. This avoids repeated access to the data and reduces the preprocessing unit 1012. Waiting time, which improves its processing efficiency.

In an embodiment, the pre-processing unit 1012 pre-processes the audio signal before extracting the Mel-frequency cepstrum coefficient of the audio signal by using the Mel-frequency cepstrum coefficient algorithm, and after the pre-processing of the audio signal is used, The Mel frequency cepstrum coefficient algorithm extracts the Mel frequency cepstrum coefficient of the audio signal.

Among them, after receiving the first control information from the micro control unit 1011, the pre-processing unit 1012 first performs pre-emphasis and windowing on the audio signal.

Among them, pre-emphasis is to increase the energy of the high-frequency part of the audio signal. For the frequency spectrum of audio signals, the energy in the low-frequency part is often higher than the energy in the high-frequency part. The spectrum energy is attenuated by 20dB for every 10 times Hz, and due to the influence of the noise background of the circuit when the microphone is collecting audio signals, It will increase the energy of the low frequency part. In order to make the energy of the high frequency part and the energy of the low frequency part have a similar amplitude, it is necessary to pre-enhance the high frequency energy of the collected audio signal.

Because audio signals are generally non-stationary signals, their statistical characteristics are not fixed, but in a relatively short period of time, the signals can be considered to be stable, which is called windowing. The window is described by three parameters: window length (in milliseconds), offset, and shape. Each windowed audio signal is called a frame, the milliseconds of each frame is called the frame length, and the distance between the left borders of two adjacent frames is called the frame shift. In the embodiment of the present application, a Hamming window with edge smoothing down to 0 may be used for windowing processing.

After pre-processing the audio signal, the pre-processing unit 1012 can use the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal. The process of extracting the Mel frequency cepstrum coefficient by the pre-processing unit 1012 is roughly: using the non-linear characteristics of human hearing to convert the frequency spectrum of the audio signal into a non-linear spectrum based on the Mel frequency, and then converting it to the cepstrum domain. This results in the Mel frequency cepstrum coefficient.

In an embodiment, the pre-processing unit 1012 is further configured to extract a voiceprint feature of the audio signal before preprocessing the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and determine the voiceprint feature in the voiceprint feature. When matched with the preset voiceprint characteristics, the audio signal is pre-processed.

It should be noted that, in actual life, everyone's voice when speaking has its own characteristics. Familiar people can distinguish each other only by listening to the voice. The characteristic of this sound is the voiceprint feature. The voiceprint feature is mainly determined by two factors. The first is the size of the acoustic cavity, which specifically includes the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the vocal cord tension Size and range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.

The second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated. The vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.

Wherein, after receiving the first control information from the micro control unit 1011, the preprocessing unit 1012 first extracts the voiceprint characteristics of the audio signal.

After acquiring the voiceprint feature of the voice information, the preprocessing unit 1012 further compares the acquired voiceprint feature with a preset voiceprint feature to determine whether the voiceprint feature matches the preset voiceprint feature. The preset voiceprint feature may be a voiceprint feature previously recorded by the owner, and it is determined whether the acquired voiceprint feature matches the preset voiceprint feature, that is, whether the speaker of the audio signal is the owner.

When the acquired voiceprint features match the preset voiceprint features, the pre-processing unit 1012 determines the speaker of the audio signal as the owner, and then further pre-processes the audio signal and extracts the Mel frequency cepstrum coefficient. For details, refer to the related descriptions above, and details are not described herein again.

In an embodiment, the pre-processing unit 1012 is further configured to obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, determine whether the acquired similarity is greater than or equal to the first preset similarity, and When the similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature.

Wherein, the pre-processing unit 1012 may obtain the voiceprint feature (that is, the voiceprint feature obtained from the aforementioned audio signal) and the preset voiceprint feature when determining whether the acquired voiceprint feature matches the preset voiceprint feature. And determine whether the obtained similarity is greater than or equal to the first preset similarity (set according to actual needs, for example, it can be set to 95%). If the acquired similarity is greater than or equal to the first preset similarity, it is determined that the acquired voiceprint feature matches the preset voiceprint feature; if the acquired similarity is less than or equal to the similarity, the acquired voiceprint is determined The pattern feature does not match the preset voiceprint feature.

In addition, when the acquired voiceprint features do not match the preset voiceprint features, the preprocessing unit 1012 determines that the speaker of the current audio signal is not the owner, and sends third feedback information to the micro control unit 1011.

After receiving the third feedback information from the pre-processing unit 1012, the micro control unit 1011 deletes the acquired audio signals and continues to acquire external audio signals. The audio signals are not processed until the owner's audio signals are obtained. For preprocessing and extraction of the Mel frequency cepstrum coefficient, for how to perform the preprocessing and the extraction of the Mel frequency cepstrum coefficient, reference may be made to the relevant descriptions of the foregoing embodiments, and details are not described herein again.

Therefore, by using the voiceprint feature to authenticate the speaker based on the voiceprint feature, only responding to the audio signal sent by the owner can avoid performing operations that are not intended by the owner, and can improve the user experience.

In an embodiment, the pre-processing unit 1012 is further configured to obtain current location information when the obtained similarity is less than the first preset similarity and greater than or equal to the second preset similarity, and determine the current location information based on the location information. Whether it is located within the preset position range, and when it is currently within the preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.

It should be noted that, because the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the acquired audio signal is spoken by the owner, the pre-processing unit 1012 will not be able to recognize it. In addition, there are many cases that cause the pre-processing unit 1012 to identify the owner, which will not be repeated here.

In order to solve the possible situation where the owner cannot be identified, after the preprocessing unit 1012 finishes judging the similarity of the voiceprint feature, if the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset If the similarity is set, it is further judged whether the voiceprint feature is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be appropriately selected by those skilled in the art according to actual needs) Value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).

When the judgment result is yes, that is, the similarity between the acquired voiceprint feature and the preset voiceprint feature is less than the first preset similarity and greater than or equal to the second preset similarity, the preprocessing unit 1012 further obtains Current location information. The pre-processing unit 1012 may send a position acquisition request to the positioning module of the electronic device (different positioning technologies such as satellite positioning technology or base station positioning technology may be used to obtain the current position information), and instruct the positioning module to return to the current position. information.

After acquiring the current position information, the pre-processing unit 1012 determines whether it is currently within a preset position range according to the position information. Among them, the preset position range can be configured as a common position range of the owner, such as home and company.

When it is currently within the preset position range, the preprocessing unit 1012 determines that the acquired voiceprint feature matches the preset voiceprint feature, and recognizes the speaker of the audio signal as the owner.

In one embodiment, the central processing unit 102 is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence of the candidate keyword reaches a preset confidence level, according to a preset keyword and a preset operation. Correspondence relationship, a preset operation corresponding to a target keyword is determined as a target operation, and the target operation is performed.

Among them, the central processing unit 102 first determines the confidence level of the candidate keywords after extracting the identified “candidate keywords and the confidence level of the candidate keywords” from the ASIC chip 101 according to the instruction information of the ASIC chip 101. Whether the preset reliability is reached (can be set by a person skilled in the art according to actual needs, for example, it can be set to 90%)

After the judgment of the confidence level is completed, and the confidence level of the candidate keywords reaches a preset confidence level, the central processing unit 102 uses the candidate keywords as the target keywords of the audio signal.

After that, the central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation. The corresponding relationship between keywords and preset operations can be set according to actual needs. For example, the preset operation corresponding to the keyword "Little Europe, Little Europe" can be set to "wake the operating system", so when the target keyword is In "Little Europe, Little Europe", if the operating system is currently in a sleep state, the central processing unit 102 will wake up the operating system.

An embodiment of the present application provides a device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application-specific integrated circuit chip, and the power consumption of the application-specific integrated circuit chip is less than that of the central processing unit. Consumption, the device control method includes:

In an embodiment, the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit. The application-specific integrated circuit chip identifies the audio signal and obtains a recognition result, including:

The preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;

According to the control of the micro control unit, the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .

In an embodiment, the performing a target operation corresponding to the recognition result includes:

When the confidence level reaches a preset confidence level, the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target. The preset operation of the keyword is determined as the target operation, and the target operation is performed.

In an embodiment, the application-specific integrated circuit chip further includes a memory, and the device control method further includes:

The memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.

In an embodiment, the application-specific integrated circuit chip further includes a cache memory, and the device control method further includes:

The cache memory caches data stored in the memory and data fetched from the memory.

In an embodiment, before using the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:

The pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.

The pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.

In an embodiment, determining whether the voiceprint feature matches a preset voiceprint feature includes:

The preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.

In an embodiment, the device control method provided in the embodiment of the present application further includes:

When the similarity is smaller than the first preset similarity and greater than or equal to a second preset similarity, the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.

When the voiceprint feature does not match the preset voiceprint feature, the preprocessing unit instructs the micro-control unit to delete the audio signal.

Further, an embodiment of the present application further provides a device control method. The device control method is executed by an electronic device provided in the embodiment of the present application. The electronic device includes an application specific integrated circuit chip 101 and a central processing unit 102, and the application specific integrated circuit chip. The power consumption of 101 is smaller than the power consumption of the central processing unit 102. Please refer to FIG. 5. The device control method includes:

101. The application specific integrated circuit chip 101 obtains an external audio signal.

The ASIC chip 101 can obtain external audio signals in many different ways. For example, when the electronic device is not externally connected to a microphone, the ASIC chip 101 may be The sound emitted by an external speaker is collected to obtain an external audio signal; for example, when an electronic device is externally connected with a microphone, the ASIC chip 101 may collect external sound through the external microphone of the electronic device to obtain an external audio signal.

102. The application-specific integrated circuit chip 101 performs a recognition operation on the acquired audio signal to obtain a recognition result.

103. The application specific integrated circuit chip 101 sends instruction information indicating completion of the identification operation to the central processing unit 102.

104. The central processing unit 102 extracts the foregoing recognition result from the ASIC chip 101 according to the received instruction information, and performs a target operation corresponding to the foregoing recognition result.

It can be known from the above that the electronic device in the embodiment of the present application first obtains an external audio signal from the application-specific integrated circuit chip 101 with low power consumption, performs a recognition operation on the acquired audio signal, obtains a recognition result, and sends an instruction for the recognition operation. The completed instruction information is sent to the central processing unit 102, and the central processing unit 102 extracts the recognition result from the ASIC chip 101 according to the instruction information, and performs a target operation corresponding to the recognition result. Therefore, the audio recognition task of the central processing unit 102 is shared to the application-specific integrated circuit chip 101 with lower power consumption, and the corresponding processing is performed by the central processing unit 102 according to the recognition result of the application-specific integrated circuit chip 101. The manner in which the ASIC cooperates with the central processing unit 102 to perform voice control on the electronic device can reduce the power consumption of the electronic device to implement voice control.

In an embodiment, please refer to FIG. 2. The ASIC chip 101 includes a micro control unit 1011, a pre-processing unit 1012, and an algorithm unit 1013. Referring to FIG. 6, the ASIC chip 101 performs a recognition operation on the acquired audio signal. Steps to get recognition results, including:

1021, the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal according to the control of the micro control unit 1011;

1022. The algorithm unit 1013 uses the deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficients according to the control of the micro control unit 1011, and obtains the candidate keywords and the confidence of the candidate keywords.

The micro control unit 1011 first obtains an external audio signal through a microphone. For example, when the electronic device is not externally connected with a microphone, the micro control unit 1011 can collect external sound through a built-in microphone (not shown in FIG. 2) of the electronic device to obtain an external audio signal. For example, when a microphone is externally connected to the electronic device, the micro control unit 1011 can collect external sound through the microphone externally connected to the electronic device to obtain an external audio signal.

In addition, referring to FIG. 3, the ASIC chip 101 further includes a memory 1014. The memory 1014 can be used to store the acquired audio signals, identify candidate keywords, confidence, and during the execution of the preprocessing unit 1012 and the algorithm unit 1013 Generated intermediate data.

Referring to FIG. 4, the ASIC chip 101 further includes a cache memory 1015, which can be used to cache data stored in the memory 1014 and data retrieved from the memory 1014.

In an embodiment, referring to FIG. 7, the central processing unit 102 executes a target operation corresponding to the foregoing recognition result, including:

1041. The central processing unit 102 uses the candidate keywords as target keywords of the audio signal when the confidence of the candidate keywords reaches a preset confidence level;

1042. The central processing unit 102 determines the preset operation corresponding to the target keyword as the target operation according to the correspondence between the preset keyword and the preset operation, and executes the target operation.

In an implementation manner, before the preprocessing unit 1012 uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal, the method further includes:

(1) The pre-processing unit 1012 pre-processes the audio signal;

(2) After the preprocessing unit 1012 finishes preprocessing the audio signal, it uses the Mel frequency cepstrum coefficient algorithm to extract the Mel frequency cepstrum coefficient of the audio signal.

In an embodiment, before the step of preprocessing the audio signal by the preprocessing unit 1012, the method further includes:

(1) The preprocessing unit 1012 extracts the voiceprint features of the audio signal;

(2) The preprocessing unit 1012 determines whether the extracted voiceprint features match the preset voiceprint features;

(3) The pre-processing unit 1012 pre-processes the aforementioned audio signal when the extracted voiceprint features match the preset voiceprint features.

In an embodiment, the step of the pre-processing unit 1012 determining whether the extracted voiceprint features match the preset voiceprint features includes:

(1) The preprocessing unit 1012 obtains the similarity between the aforementioned voiceprint feature and the preset voiceprint feature;

(2) The preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity;

(3) When the obtained similarity is greater than or equal to the first preset similarity, the preprocessing unit 1012 determines that the obtained voiceprint feature matches the preset voiceprint feature.

In an embodiment, after the preprocessing unit 1012 determines whether the obtained similarity is greater than or equal to the first preset similarity, the method further includes:

(1) The pre-processing unit 1012 obtains the current position information when the aforementioned similarity is less than the first preset similarity and greater than or equal to the second preset similarity;

(2) The pre-processing unit 1012 determines whether it is currently within a preset position range according to the obtained position information;

(3) When the pre-processing unit 1012 is currently located within a preset position range, it is determined that the aforementioned voiceprint feature matches the preset voiceprint feature.

The electronic device and the device control method provided by the embodiments of the present application have been described in detail above. Specific examples are used in this document to explain the principles and implementation of the present application. The description of the above embodiments is only used to help understand the present invention. The application method and its core ideas; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be understood as Restrictions on this application.

Claims

An electronic device, wherein the electronic device includes an application specific integrated circuit chip and a central processing unit, and the power consumption of the application specific integrated circuit chip is less than the power consumption of the central processing unit, wherein:

The application-specific integrated circuit chip is configured to obtain an external audio signal, perform an identification operation on the audio signal, obtain an identification result, and send instruction information indicating completion of the identification operation to the central processing unit;

The central processing unit is configured to extract the recognition result from the application-specific integrated circuit chip according to the instruction information, and execute a target operation corresponding to the recognition result.
The electronic device according to claim 1, wherein the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit, wherein:

The preprocessing unit is configured to extract a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;

The algorithm unit is configured to perform keyword recognition on the Mel frequency cepstrum coefficient using a deep neural network algorithm according to the control of the micro control unit to obtain candidate keywords and the confidence of the candidate keywords.
The electronic device according to claim 2, wherein the central processing unit is further configured to use the candidate keyword as a target keyword of the audio signal when the confidence reaches a preset confidence, Set the correspondence between the keywords and the preset operations, determine the preset operation corresponding to the target keyword as the target operation, and execute the target operation.
The electronic device according to claim 2, wherein the application-specific integrated circuit chip further comprises a memory for storing the audio signal, the candidate keywords, the confidence level, the pre-processing unit, and the algorithm Intermediate data generated by the unit during execution.
The electronic device according to claim 4, wherein the application-specific integrated circuit chip further comprises a cache memory for buffering data stored in the memory and data fetched from the memory.
The electronic device according to claim 2, wherein the pre-processing unit is further configured to pre-process the audio signal, and after pre-processing the audio signal, extract it using a Mel frequency cepstrum coefficient algorithm Mel frequency cepstrum coefficient of the audio signal.
The electronic device according to claim 6, wherein the pre-processing unit is further configured to extract the voiceprint feature of the audio signal, determine whether the voiceprint feature matches a preset voiceprint feature, and When the texture feature matches the preset voiceprint feature, the audio signal is pre-processed.
The electronic device according to claim 7, wherein the preprocessing unit is further configured to obtain a similarity between the voiceprint feature and the preset voiceprint feature, and determine whether the similarity is greater than or equal to a first preset A similarity is set, and when the similarity is greater than or equal to the first preset similarity, it is determined that the voiceprint feature matches the preset voiceprint feature.
The electronic device according to claim 8, wherein the preprocessing unit is further configured to obtain a current position when the similarity is less than the first preset similarity and greater than or equal to a second preset similarity. Information, to determine whether it is currently within a preset position range according to the position information, and when it is currently within a preset position range, determine that the voiceprint feature matches the preset voiceprint feature.
The electronic device according to claim 7, wherein the pre-processing unit is further configured to instruct the micro-control unit to delete the audio signal when the voiceprint feature does not match the preset voiceprint feature.
A device control method applied to an electronic device, wherein the electronic device includes a central processing unit and an application specific integrated circuit chip, and the power consumption of the application specific integrated circuit chip is less than the power consumption of the central processor, and the device Control methods include:

The application-specific integrated circuit chip acquires an external audio signal;

The application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result;

The application-specific integrated circuit chip sends identification completion indication information to the central processing unit;

The central processing unit extracts the recognition result from the application-specific integrated circuit chip according to the instruction information, and performs a target operation corresponding to the recognition result.
The device control method according to claim 11, wherein the application-specific integrated circuit chip includes a micro control unit, a pre-processing unit, and an algorithm unit, and the application-specific integrated circuit chip recognizes the audio signal to obtain a recognition result, including: :

The preprocessing unit extracts a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm according to the control of the micro control unit;

According to the control of the micro control unit, the algorithm unit uses a deep neural network algorithm to perform keyword recognition on the Mel frequency cepstrum coefficient, and obtains candidate keywords and the confidence level of the candidate keywords as the recognition result. .
The device control method according to claim 12, wherein the performing a target operation corresponding to the recognition result comprises:

When the confidence level reaches a preset confidence level, the central processing unit uses the candidate keyword as a target keyword of the audio signal, and according to the correspondence between the preset keyword and the preset operation, it will correspond to the target. The preset operation of the keyword is determined as the target operation, and the target operation is performed.
The device control method according to claim 12, wherein the application-specific integrated circuit chip further comprises a memory, and the device control method further comprises:

The memory stores the audio signal, the candidate keywords, the confidence level, and intermediate data generated by the preprocessing unit and the algorithm unit during execution.
The device control method according to claim 14, wherein the application-specific integrated circuit chip further comprises a cache memory, and the device control method further comprises:

The cache memory caches data stored in the memory and data fetched from the memory.
The device control method according to claim 12, wherein before extracting a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm, further comprising:

The pre-processing unit pre-processes the audio signal. After pre-processing the audio signal, a Mel frequency cepstrum coefficient algorithm is used to extract a Mel frequency cepstrum coefficient of the audio signal.
The device control method according to claim 16, before the extracting a Mel frequency cepstrum coefficient of the audio signal using a Mel frequency cepstrum coefficient algorithm, further comprising:

The pre-processing unit extracts a voiceprint feature of the audio signal, determines whether the voiceprint feature matches a preset voiceprint feature, and when the voiceprint feature matches the preset voiceprint feature, The audio signal is pre-processed.
The device control method according to claim 17, wherein the determining whether the voiceprint feature matches a preset voiceprint feature comprises:

The preprocessing unit obtains a similarity between the voiceprint feature and the preset voiceprint feature, determines whether the similarity is greater than or equal to a first preset similarity, and when the similarity is greater than or equal to the When the first preset similarity is determined, it is determined that the voiceprint feature matches the preset voiceprint feature.
The device control method according to claim 18, further comprising:

When the similarity is smaller than the first preset similarity and greater than or equal to a second preset similarity, the preprocessing unit obtains current position information, and determines whether the current position information is located in a preset position range according to the position information. And when it is currently within a preset position range, it is determined that the voiceprint feature matches the preset voiceprint feature.
The device control method according to claim 17, further comprising:

When the voiceprint feature does not match the preset voiceprint feature, the preprocessing unit instructs the micro-control unit to delete the audio signal.