CN113393865B - Power consumption control, mode configuration and VAD method, apparatus and storage medium - Google Patents

Power consumption control, mode configuration and VAD method, apparatus and storage medium Download PDF

Info

Publication number
CN113393865B
CN113393865B CN202010176807.6A CN202010176807A CN113393865B CN 113393865 B CN113393865 B CN 113393865B CN 202010176807 A CN202010176807 A CN 202010176807A CN 113393865 B CN113393865 B CN 113393865B
Authority
CN
China
Prior art keywords
vad
mode
voice chip
voice
currently used
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010176807.6A
Other languages
Chinese (zh)
Other versions
CN113393865A (en
Inventor
杨智慧
付强
田彪
马骁
吴登峰
袁斌
余磊
左玲云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010176807.6A priority Critical patent/CN113393865B/en
Priority to PCT/CN2021/080172 priority patent/WO2021180162A1/en
Publication of CN113393865A publication Critical patent/CN113393865A/en
Application granted granted Critical
Publication of CN113393865B publication Critical patent/CN113393865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the application provides a power consumption control method, a mode configuration method, a VAD method, equipment and a storage medium. In the embodiment of the application, the voice chip or the equipment has a VAD function, whether a voice signal is input or not can be detected by using the VAD function, and the voice chip or the equipment enters a normal working mode from a low power consumption mode under the condition that the voice signal is detected, so that the power consumption of the voice chip or the equipment can be saved; furthermore, the voice chip or equipment has both hardware VAD function and software VAD function, the hardware VAD function and the software VAD function are combined for use to generate a plurality of VAD modes, and the VAD modes used by the voice chip or equipment are flexibly configured, so that the accuracy of the voice input detection result can be improved to a certain extent, the false triggering probability is reduced, and the low power consumption performance of the voice chip or equipment is improved.

Description

Power consumption control, mode configuration and VAD method, apparatus and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a power consumption control method, a mode configuration method, a VAD method, a device, and a storage medium.
Background
The scheme for realizing low power consumption of a conventional Voice chip is mainly to embed a hardware VAD (Voice Activity Detection) module in the Voice chip. The hardware VAD module detects whether a voice signal is input or not based on the signal energy; when no voice signal input is detected, the voice chip is in a low power consumption mode; when detecting that a voice signal is input, the voice chip is awakened and starts voice processing.
Because the hardware VAD module detects whether a voice signal is input or not based on the signal energy, the voice chip is easily triggered by mistake in a noisy environment, and the low power consumption performance of the voice chip is poor.
Disclosure of Invention
Aspects of the present disclosure provide a power consumption control method, a mode configuration method, a VAD method, apparatus, and a storage medium to improve accuracy of a voice input detection result and reduce false triggering probability.
The embodiment of the application provides a power consumption control method, which is suitable for a voice chip or equipment, wherein the voice chip or equipment has a hardware VAD function and a software VAD function; the method comprises the following steps: collecting sound signals input into a voice chip or equipment; detecting whether the sound signal contains a voice signal or not by utilizing a voice chip or a VAD mode currently used by equipment; if the sound signal contains a voice signal, the voice chip or the equipment enters a normal working mode from a low power consumption mode; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the present application further provides a VAD mode configuration method, which is applicable to a voice chip or device, where the voice chip or device has a hardware VAD function and a software VAD function, and the method includes: receiving a policy configuration instruction; according to the strategy configuration instruction, a voice chip or equipment is configured to use VAD mode using strategies required by various VAD modes; wherein the plurality of VAD modes result from a combined use of a hardware VAD function and a software VAD function.
The embodiment of the present application further provides a voice endpoint detection method, which is applicable to a voice chip or device, where the voice chip or device has a hardware VAD function and a software VAD function, and the method includes: collecting sound signals input into a voice chip or equipment; performing VAD processing on the sound signal by utilizing a VAD mode currently used by a voice chip or equipment; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the present application further provides a voice chip, including: the system comprises a pickup module, a hardware VAD module, a processor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting the sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled; the processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled; the processor is further configured to execute the power consumption control program to: under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode used currently, controlling the voice chip to enter a normal working mode from a low power consumption mode; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the present application further provides a voice chip, including: the system comprises a pickup module, a hardware VAD module, a main processor, a coprocessor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting the sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled; the co-processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled; the coprocessor is further configured to execute the power consumption control program to: controlling the main processor to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the application further provides an intelligent terminal, which comprises a voice chip, wherein the voice chip comprises a pickup module, a hardware VAD module, a processor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting the sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled; the processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled; the processor is further configured to execute the power consumption control program to: under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used, controlling the voice equipment to enter a normal working mode from a low power consumption mode; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the present application further provides an intelligent terminal, including: the voice chip comprises a pickup module, a hardware VAD module, a coprocessor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting the sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled; the co-processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled; the coprocessor is further configured to execute the power consumption control program to: controlling the main processor to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
The embodiment of the application also provides a self-service terminal, which comprises a voice chip and a main processor; the voice chip comprises a pickup module, a hardware VAD module, a coprocessor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting the sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled; the co-processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled; the coprocessor is further configured to execute the power consumption control program to: controlling the main processor to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used; wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function.
Embodiments of the present application also provide a computer readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the power consumption control method, the VAD mode configuration method, or the voice endpoint detection method provided in the embodiments of the present application.
In the embodiment of the application, the voice chip or the equipment has the VAD function, whether a voice signal is input or not can be detected by utilizing the VAD function, and under the condition that the voice signal is detected, the voice chip or the equipment enters a normal working mode from a low power consumption mode, so that the power consumption of the voice chip or the equipment can be saved; furthermore, the voice chip or equipment has both hardware VAD function and software VAD function, the hardware VAD function and the software VAD function are combined for use to generate a plurality of VAD modes, and the VAD modes used by the voice chip or equipment are flexibly configured, so that the accuracy of the voice input detection result can be improved to a certain extent, the false triggering probability is reduced, and the low power consumption performance of the voice chip or equipment is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flow chart of a power consumption control method according to an exemplary embodiment of the present application;
fig. 2 is a flowchart illustrating a VAD mode configuration method according to an exemplary embodiment of the present application;
fig. 3 is a flowchart illustrating a voice endpoint detection method according to an exemplary embodiment of the present application;
fig. 4a is a schematic structural diagram of a speech chip according to an exemplary embodiment of the present application;
FIG. 4b is a diagram illustrating a configuration voice chip according to an exemplary embodiment of the present application;
FIG. 4c is a schematic structural diagram of another speech chip provided in an exemplary embodiment of the present application;
FIG. 5 is a schematic structural diagram of another speech chip according to an exemplary embodiment of the present application;
fig. 6 is a schematic structural diagram of an intelligent terminal according to an exemplary embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of another intelligent terminal provided in an exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, whether a voice signal is input or not is detected based on a hardware VAD module, and the voice signal is easily triggered by mistake in a noisy environment, so that the low power consumption performance of a voice chip is poor. In view of the technical problem, in the embodiment of the present application, the voice chip or device has both a hardware VAD function and a software VAD function, and multiple VAD modes can be generated by combining the hardware VAD function and the software VAD function, so that the accuracy of the voice input detection result can be improved to a certain extent by flexibly configuring the VAD modes used by the voice chip or device, the false triggering probability can be reduced, and the low power consumption performance of the voice chip or device can be improved.
Before the embodiments of the present application are described in detail, the voice chip or the device in the embodiments of the present application will be explained. In the embodiments of the present application, there is no limitation on the voice chip or device, and any chip or device capable of performing voice signal detection and other processing (e.g., storing and playing) is suitable for the embodiments of the present application. The voice device may be an electronic device including a voice chip, wherein the voice device may include other components besides the voice chip, such as a communication module, e.g., WiFi, bluetooth, etc., a display, a power module, etc. Wherein, the voice chip at least comprises a processor and a memory.
If the function of the voice device is simple, the voice device may not include any other processor if the processor in the voice chip is sufficient to implement the basic function of the voice device, and in such a case, the processor in the voice chip needs to implement not only the function related to the voice processing but also the basic function of the voice device. For example, suppose that the voice device is an intelligent alarm clock supporting voice broadcast, the intelligent alarm clock includes a voice chip, and since the alarm clock function is simpler, the intelligent alarm clock does not include other processors except a processor in the voice chip, which can reduce the implementation cost of the intelligent alarm clock, and at this time, the processor in the voice chip needs to implement the function related to voice processing on the one hand, and on the other hand, the basic functions such as timing and the like are also needed to be implemented.
Certainly, if the function of the voice device is relatively powerful, or the processing capability of the processor in the voice chip is limited, and the processor in the voice chip is not enough to implement the basic function of the voice device, the voice device may further include a main processor, and the processor in the voice chip may be implemented as a co-processor. For example, it is assumed that the voice device is a smart phone, the smart phone includes a voice chip, and the smart phone has a relatively powerful function, so that the smart phone further includes a main processor besides a processor in the voice chip, the processor in the voice chip is mainly responsible for functions related to voice processing as a coprocessor, and the main processor is mainly responsible for implementing basic functions of the smart phone, such as functions of communication, wireless internet access, game playing, video playing, photographing, online transaction, and the like.
In the embodiment of the present application, the voice chips or devices may be applied in a low power consumption scenario, and the power consumption problem needs to be considered. For example, in the scenario where the voice chip or the device is powered by a battery (e.g., a dry battery, a storage battery or a storage battery), or the device where the voice chip is located is powered by a battery, the power consumption problem needs to be considered. The battery-powered voice device or the device where the voice chip is located may be any intelligent terminal including a voice chip and having a voice function, including but not limited to: various remote controllers, story tellers, smart speakers, tablet computers, smart phones, smart robots, smart alarm clocks, smart bracelets, smart switches, smart speakers, unmanned delivery vehicles, self-service express cabinets or self-service terminals, and the like. The self-service terminal can be a supermarket POS machine, a bank self-service cash dispenser, a self-service shopping guide service terminal in a market, an airport and other scenes, and the like. Of course, the voice chip or device may also be powered using a non-battery type power source (e.g., mains), which may include, but is not limited to: televisions, air conditioners, water heaters, desktop computers, and the like.
In any power supply mode, and no matter the implementation form of the voice chip or the voice equipment, the voice chip or the voice equipment in the embodiment of the application has both a hardware VAD function and a software VAD function. The hardware VAD function refers to a VAD function implemented by a hardware VAD module built in a voice chip or device, and optionally, the hardware VAD module may be fixed on the voice chip or device, and the VAD function implemented by the hardware VAD module may be modified by configuring parameters. A software VAD function refers to a VAD function implemented by a processor in a voice chip or device executing a VAD program.
In an embodiment of the application, a hardware VAD function is used in combination with a software VAD function. Wherein the combination of the hardware VAD function and the software VAD function can generate a plurality of VAD modes. For example, at least the following VAD modes may be combined: hardware VAD modes, software VAD modes, soft-hard combined VAD modes, and soft-soft combined VAD modes, etc. Wherein, the hardware VAD mode refers to a mode of VAD by independently using a hardware VAD module; the software VAD mode refers to a mode for performing VAD by using a processor to execute a VAD program; the soft-hard combined VAD mode refers to a mode that a hardware VAD module is used for carrying out VAD for one time, and then a processor executes a VAD program for carrying out VAD for one time; the soft-combining VAD mode refers to a mode in which the processor performs the above VAD procedures for VAD multiple times.
In the embodiment of the present application, the implementation manner of the hardware VAD is not limited. For example, the hardware VAD mode may detect whether a voice signal is contained in a received voice signal based on the energy of the voice signal. Similarly, in the embodiment of the present application, the implementation of the software VAD is not limited. For example, one implementation of a software VAD includes: performing framing processing on the sound signal; extracting features from each frame of data; training a classifier on a set of data frames of a known speech and silence signal region; and classifying the unknown framing data, and judging whether the unknown framing data belongs to a voice signal or a silent signal so as to obtain whether the voice signal is input. For another example, another implementation of a software VAD includes: training a neural Network VAD (neural Network-VAD, NN-VAD) model in advance through a human voice sample, wherein the model can detect whether a voice signal is contained in the voice signal; based on this, the voice signal can be fed into an NN-VAD) model, which is used to detect whether a voice signal is input.
In the embodiments of the present application, a hardware VAD function and a software VAD function are used in combination to generate a plurality of VAD modes, and the VAD modes used by a voice chip or device are allowed to be flexibly configured according to the environmental information, application scenario, time information, and/or user preference where the voice chip or device is located or is to be located.
The power consumption control method based on the VAD mode currently used by the voice chip or device and the method for configuring the VAD mode used by the voice chip or device provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a power consumption control method according to an exemplary embodiment of the present disclosure. The method is suitable for a voice chip or equipment, and the voice chip or equipment has a hardware VAD function and a software VAD function at the same time. For the related description of the voice chip or device and the hardware VAD function and the software VAD function, refer to the foregoing embodiments, which are not described herein again. As shown in fig. 1, the method includes:
11. collecting sound signals input into a voice chip or equipment.
12. Whether the sound signal contains a voice signal is detected by using a voice chip or a VAD mode currently used by the equipment.
13. If the voice signal comprises a voice signal, the voice chip or the equipment enters a normal working mode from a low power consumption mode; among them, the currently used VAD mode is one of a plurality of VAD modes generated by the combined use of the hardware VA D function and the software VAD function.
In this embodiment, the voice chip or device includes a sound pickup module such as a microphone or a microphone, and can capture a sound signal entering the voice chip or device. In the present embodiment, the sound signal broadly refers to a sound wave generated by vibration of an object. The sound signal entering the voice chip or device may include both the voice signal and the environmental noise, or may include only the environmental noise but not the voice signal. Among them, environmental noise includes but is not limited to: traffic noise, construction noise, industrial noise, social noise, and the like. The collected sound signals are different according to different application scenes of the voice chip or the equipment. For example, in an outdoor scene, the sound signal collected by the voice chip or device may include traffic noise, and may also include a voice signal emitted by a user using the voice chip or device. For another example, in various social scenes such as business transactions, sports events, tourist attractions, entertainment venues, etc., the sound signal collected by the voice chip or device may include noises generated by surrounding users, and may also include a voice signal sent by a user using the voice chip or device.
In this embodiment, the voice chip or device may be in a low power consumption mode during the time when no voice signal is input, to save power consumption. When the voice signal is collected, whether the voice signal is contained in the voice signal can be detected by utilizing a voice chip or a VAD function of equipment; when the voice signal is detected to be contained in the voice signal, the voice chip or the device enters a normal working mode from a low power consumption mode so as to carry out voice processing. In the normal operating mode, the voice chip or device may further process the voice signal, for example, identify whether the voice signal is a specific wake-up word, and if not, re-enter the sleep state. Optionally, when the voice signal is not included in the sound signal, the voice chip or the device may continue to be kept in the low power consumption mode, saving power consumption.
In this embodiment, when the voice chip or the device enters the low power consumption mode, the sound pickup module and a part of hardware modules implementing the VAD function may operate normally, and other hardware modules may stop operating. The normal operation mode is a relatively low power consumption mode, in which each hardware module in the voice chip or the device can normally operate, and most or all functions can be normally used. The functions of the voice chip or device in the low power consumption mode and the normal operating mode can be used normally, and the functions can be flexibly implemented in the chip or device design process according to the application scene and the application requirements, which is not limited.
In this embodiment, the voice chip or device not only has the hardware VAD function, but also has the software VAD function. In this embodiment, a combination of software VAD functions and hardware VAD functions can be used to obtain multiple VAD modes. Wherein, the software VAD function and the hardware VAD function are combined to obtain at least the following VAD modes: a hardware VAD mode, a software VAD mode, a soft-hard combined VAD mode, and a soft-soft combined VAD mode. For detailed descriptions of the hardware VAD function, the software VAD function and various VAD modes, please refer to the foregoing embodiments, which are not described herein again.
In this embodiment, the VAD mode supported (or selectively used) by the voice chip or device is not limited. The VAD modes supported by the voice chip or device at least include the above listed multiple (i.e. two or more) VAD modes. In this embodiment, flexible configuration of the VAD mode currently used by the voice chip or device is allowed. The VAD mode currently used by the voice chip or device is one of a plurality of VAD modes supported by the voice chip or device. The VAD mode currently used may be different according to different VAD modes supported by the voice chip or the device, and is not limited to this. The following illustrates the various VAD modes supported by the voice chip or device and the currently used VAD mode in four cases:
case 1: the voice chip or equipment supports a hardware VAD mode and a software VAD mode; accordingly, the VAD mode currently used by the voice chip or device is one of the two VAD modes, specifically which is flexibly configurable;
case 2: the voice chip or equipment supports a hardware VAD mode and a software and hardware combined VAD mode; accordingly, the VAD mode currently used by the voice chip or device is one of the two VAD modes, specifically which is flexibly configurable;
case 3: the voice chip or device supports a software VAD mode and a soft-hard combined VAD mode; accordingly, the VAD mode currently used by the voice chip or device is one of the two VAD modes, specifically which is flexibly configurable;
case 4: the voice chip or device supports hardware VAD, software VAD and soft-hard combined VAD; accordingly, the VAD mode currently used by the voice chip or device is one of the three VAD modes, specifically which is flexibly configurable.
In this embodiment, after the voice signal is collected, it may be detected whether the voice signal is included in the voice signal by using a VAD mode currently used by a voice chip or a device. The embodiment of detecting whether the voice signal is contained in the voice signal by using the currently used VAD mode is also different according to the difference of the currently used VAD mode of the voice chip or the device. The following examples illustrate:
in an alternative embodiment a1, when the VAD mode currently used by the voice chip or device is the hardware VAD mode, the voice signal can be sent to a hardware VAD module in the voice chip or device, and whether the voice signal includes the voice signal is detected in a hardware manner; if the voice signal is detected to be contained in the voice signal in a hardware mode, the fact that the voice signal is contained in the voice signal can be directly determined. In an alternative embodiment a1, in the low power consumption mode, the sound pickup module and the hardware VAD module in the voice chip or device are in normal operation states, and there is no limitation on whether other modules are in normal operation states.
In an alternative embodiment a2, when the VAD mode currently used by the voice chip or device is the software VAD mode, the voice signal can be sent to a processor in the voice chip or device, and whether the voice signal is included in the voice signal is detected in a software manner; if the voice signal is detected to be contained in the voice signal in a software mode, the voice signal contained in the voice signal can be directly determined. In an alternative embodiment a2, in the low power consumption mode, the pickup module in the voice chip or the device is in a normal operation state, the processor can execute at least the VAD program, and whether other modules are in a normal operation state is not limited.
In an alternative embodiment a3, the VAD mode currently used by the voice chip or device is a software and hardware combined VAD mode, on one hand, the voice signal can be sent to a hardware VAD module in the voice chip or device, and whether the voice signal includes the voice signal is detected in a hardware manner; on the other hand, the sound signal can be sent to a processor in a voice chip or equipment, and if the sound signal is detected to contain the voice signal in a hardware mode, the processor can detect whether the sound signal contains the voice signal again in a software mode; and if the voice signal is detected to be contained in the voice signal again in a software mode, determining that the voice signal is contained in the voice signal. If the voice signal is not detected to contain the voice signal in a hardware mode, the subsequent steps can not be executed, and the voice chip or the equipment is kept in the low power consumption mode. Similarly, if the voice signal is not detected to be included in the sound signal when the software mode is used for detecting again, the voice chip or the device is kept in the low power consumption mode. In an alternative embodiment a3, in the low power consumption mode, the pickup module and the hardware VAD module in the voice chip or device are in normal operation states, the processor can execute at least the VAD program, and there is no limitation on whether other modules are in normal operation states.
In the embodiment of the application, the voice chip or device supports multiple VAD modes, and the VAD mode used by the voice chip or device can be flexibly configured according to requirements, so that the VAD mode which is more in line with the requirements is used, the accuracy of the voice input detection result is improved to a certain extent, the false triggering probability is reduced, and the low power consumption performance of the voice chip or device is improved. In order to facilitate the configuration of the VAD mode used by the voice chip or the device, in the embodiment of the present application, a strategy required by the voice chip or the device to use multiple VAD modes, which is called a VAD mode use strategy, may be configured in advance. In the VAD mode use strategy, condition information required to be met by a voice chip or equipment for using various VAD modes is configured. Based on this, in practical application, the VAD mode currently used by the voice chip or device can be configured according to the VAD use strategy configured by the pre-line.
For implementation of the pre-configured VAD mode usage strategy, see the following VAD mode configuration method embodiments. It should be noted that the VAD mode configuration method embodiments described below can also be implemented separately, without depending on the power consumption control method embodiments described above.
Fig. 2 is a flowchart illustrating a VAD mode configuration method according to an exemplary embodiment of the present application. The method is also suitable for a voice chip or equipment, and the voice chip or equipment has a hardware VAD function and a software VAD function at the same time. For the related description of the voice chip or device and the hardware VAD function and the software VAD function, refer to the foregoing embodiments, which are not described in detail herein. As shown in fig. 2, the method includes:
21. receiving a policy configuration instruction;
22. according to the strategy configuration instruction, a voice chip or equipment is configured to use VAD mode using strategies required by various VAD modes; wherein the plurality of VAD modes are generated by using a combination of hardware VAD functions and software VAD functions.
In this embodiment, the sender of the policy configuration instruction is not limited, and may be any configurator having a configuration management authority on the voice chip or the device, for example, a computer, a virtual machine, various applications, a terminal device, a configurator, and the like. In this embodiment, a manner of generating the policy configuration instruction by the configurator is also not limited, for example, the policy configuration instruction may be generated by a command window or a command line, or may be generated by an App on the terminal for managing the voice chip or the device, or may be generated by some Web pages for managing the voice chip or the device.
In this embodiment, the policy configuration instruction is used to instruct the VAD mode usage policy required for configuring the voice chip or device to use various VAD modes. After the voice chip or the device receives the strategy configuration instruction, VAD mode strategies needed by using various VAD modes can be configured according to the strategy configuration instruction.
In the embodiment of the present application, the timing of configuring the VAD mode usage policy is not limited. For example, when a voice chip or device is initialized, a policy configuration instruction may be generated and sent to the voice chip or device, and the voice chip or device may configure the VAD mode usage policy according to the policy configuration instruction during the initialization process of the voice chip or device. For another example, when the voice chip or the device is factory configured, a policy configuration instruction may be generated, and the policy configuration instruction is sent to the voice chip or the device, so that the voice chip or the device may configure the VAD mode usage policy according to the policy configuration instruction in the factory configuration process. For another example, a policy configuration instruction may be generated and sent to the voice chip or device during the use of the voice chip or device, so that the voice chip or device may reconfigure the VAD mode use policy according to the policy configuration instruction during the use of the voice chip or device, thereby achieving the purpose of updating the VAD mode use policy.
In any of the above-described scenarios, the contents of the policy placement command are not limited in the present embodiment. The strategy configuration instructions contain different contents, and the modes of configuring the VAD mode using strategies according to the strategy configuration instructions are different; accordingly, the manner in which the VAD mode currently used by the voice chip or device is configured may also vary depending on the preconfigured VAD mode usage policy. The following examples illustrate:
in an alternative embodiment B1, the user is allowed to directly specify the VAD mode used by the voice chip or device through VAD configuration instructions in actual application. In order to facilitate the user to identify various VAD modes, a configuration person may configure identifiers of various VAD modes through a policy configuration instruction, and the policy configuration instruction may carry information such as the identifiers of various VAD modes or a manner of allocating the identifiers for various VAD modes. Correspondingly, in the application process, if a user wishes to specify a VAD mode to be used for the voice chip or device, the identifier of the specified VAD mode can be carried in a VAD configuration instruction and provided for the voice chip or device; for a voice chip or equipment, receiving a VAD configuration instruction provided by a user, and configuring the VAD mode currently used by the voice chip or equipment according to the VAD mode identifier carried in the VAD configuration instruction so as to achieve the purpose of specifying the VAD mode used by the voice chip or equipment. Further, the voice chip or the voice device may parse a VAD mode identifier from the VAD configuration instruction, where the VAD mode identifier is used to identify a user-specified VAD mode, where the user-specified VAD mode is a hardware VAD mode, a software VAD mode, or a software-hardware combined VAD mode; thereafter, the VAD mode currently used by the voice chip or device is configured to designate the VAD mode.
In an alternative embodiment B2, the use of VAD mode is combined with the remaining power of the voice chip or device. Based on the above, the remaining capacity range corresponding to the voice chip or the device when using various VAD modes can be configured according to the policy configuration instruction. Correspondingly, in the application process, the current residual electric quantity of the voice chip or the equipment can be monitored; and configuring the VAD mode currently used by the voice chip or the equipment according to the current residual capacity of the voice chip or the equipment.
For example, when the residual electric quantity is large, the soft-hard combined VAD mode can be used to ensure the accuracy of the voice detection result; when the residual electric quantity is general, a software VAD mode can be used, the electric quantity consumed by the VAD is reduced, and meanwhile, the accuracy of a voice detection result is guaranteed to the greatest extent; when the residual electric quantity is less, a hardware VAD mode can be used, and electric quantity is saved preferentially. The remaining power range can be divided by setting one or more power thresholds, wherein the number of the power thresholds can be flexibly set according to application requirements.
For example, two charge thresholds may be set, a first charge threshold and a second charge threshold, with the first charge threshold being greater than the second charge threshold. For example, the first power threshold is 90% power, and the second power threshold is 40% power, but not limited to this value. Based on the above, the first electric quantity threshold value and the second electric quantity threshold value can be carried in the policy configuration instruction and provided to the voice chip or the device; the voice chip or the device may configure a soft-hard combined VAD mode corresponding to a remaining power range greater than or equal to the first power threshold, configure a software VAD mode corresponding to a remaining power range greater than or equal to the second power threshold but less than the first power threshold, and configure a hardware VAD mode corresponding to a remaining power range less than the second power threshold. Correspondingly, when the VAD mode currently used by the voice chip or the equipment is configured, the current residual electric quantity of the voice chip or the equipment can be monitored, and if the current residual electric quantity of the voice chip or the equipment is greater than or equal to the first electric quantity threshold value, the VAD mode currently used by the voice chip or the equipment is configured to be a soft-hard combined VAD mode; if the current residual electric quantity of the voice chip or the equipment is greater than or equal to the second electric quantity threshold value and smaller than the first electric connection threshold value, configuring a VAD mode currently used by the voice chip or the equipment as a software VAD mode; and if the residual electric quantity of the voice chip or the equipment is smaller than the second electric quantity threshold value, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
In an alternative embodiment B3, the use of VAD mode is combined with the use of user attributes of the voice chip or device. Based on this, the corresponding user attributes of the voice chip or device when using various VAD modes can be configured according to the policy configuration instruction. Correspondingly, in the application process, the user attribute of the currently used voice chip or equipment can be monitored; and configuring the VAD mode currently used by the voice chip or the equipment according to the user attribute of the currently used voice chip or equipment. In this embodiment, the user attribute is not limited, and may be any attribute information related to the user, such as the gender, age, voiceprint characteristics, volume, user level, and distance from the user to the voice chip.
In an alternative embodiment, the users may be divided into different categories based on one or more attributes of the users. For example, users are divided into men and women by gender; and also for example, to classify users into adults and non-adults (including elderly and children) by age. Based on this, a user category, such as women or non-adults, can be set, using a software VAD mode or a software-hardware combined VAD mode; for non-set user classes, e.g. men or adults, the hardware VAD mode is used. Based on the strategy configuration information, the set user type can be carried in the strategy configuration information and provided for the voice chip or the equipment; the voice chip or the equipment analyzes the set user category from the strategy configuration information, and the software VAD mode or the soft and hard combined VAD mode is configured to correspond to the set user category; the configured hardware VAD mode corresponds to the non-set user class. In some scenarios, different VAD modes may also be used for different users, such as for set users (e.g., family member 1 and family member 2), a software VAD mode or a soft-hard combined VAD mode is used; for non-configured users (e.g., family member 3 and family member 4), the hardware VAD mode is used. Based on the strategy configuration information, the identification information of the set user can be carried in the strategy configuration information and provided for the voice chip or the equipment; the voice chip or the equipment analyzes the identification information of the set user from the strategy configuration information, and configures a software VAD mode or a soft-hard combined VAD mode to correspond to the set user; the configuration hardware VAD mode corresponds to the non-set user. Accordingly, when the VAD mode currently used by the voice chip or the equipment is configured, the user attribute currently used by the voice chip or the equipment can be monitored; judging whether the user currently using the voice chip or the equipment is a set user type or a set user according to the monitored user attribute; if the current user using the voice chip or the equipment is a set user type or a set user, configuring the VAD mode currently used by the voice chip or the equipment to be a software VAD mode or a soft-hard combined VAD mode; otherwise, if the user currently using the voice chip or the equipment is a non-configured user type or a non-configured user, the VAD mode currently used by the voice chip or the equipment is configured to be a hardware VAD mode. Optionally, voiceprint recognition may be performed on the collected user voice, and then whether the user is a set user category or a set user may be determined according to the voiceprint information. Alternatively, it may be determined whether the user is a set user type or a set user based on account information used when the user logs in.
In another alternative embodiment, the users may be divided into different categories according to their volume levels. For example, different volume thresholds may be set to divide the users into different categories, and the number of volume thresholds may be flexibly determined according to the needs. For example, two volume thresholds, a first volume threshold and a second volume threshold, are set, and the first volume threshold is smaller than the second volume threshold. Based on the above, for the user with the volume less than the first volume threshold, the user with the volume less is considered, and the accuracy of the voice detection result can be ensured by using the soft and hard combined VAD mode; for users with the volume greater than or equal to the first volume threshold but less than the second volume threshold, the volume is relatively large, and the software VAD mode is independently used, so that the accuracy of a voice detection result is ensured as much as possible while the electric quantity consumed by the VAD is reduced; for the user with the volume larger than the second volume threshold, the volume is relatively large, a hardware VAD mode can be used, and the electric quantity can be further saved while the accuracy of the voice detection result is ensured. Based on the above, the first volume threshold and the second volume threshold can be carried in the policy configuration instruction and provided to the voice chip or the device; the voice chip or the equipment analyzes a first volume threshold and a second volume threshold from the strategy configuration instruction, configures the user attribute corresponding to the software and hardware combined VAD mode to be that the volume is smaller than the first volume threshold, configures the user attribute corresponding to the software VAD mode to be larger than or equal to the first volume threshold but smaller than the second volume threshold, and configures the user attribute corresponding to the hardware VAD mode to be that the volume is smaller than the second volume threshold. Accordingly, when the VAD mode currently used by the voice chip or the device is configured, the volume of a user currently using the voice chip or the device can be monitored; if the volume of a user using the voice chip or the equipment is smaller than a set first volume threshold, configuring the VAD mode currently used by the voice chip or the equipment as a soft-hard combined VAD mode; if the volume of a user using the voice chip or the equipment is larger than or equal to the first volume threshold but smaller than the second volume threshold, configuring the VAD mode currently used by the voice chip or the equipment as a software VAD mode; and if the volume of the user using the voice chip or the equipment is larger than the second volume threshold value, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
In another alternative embodiment, the distance range from the user to the voice chip can be set, and different VAD modes are used when the distance from the user to the voice chip or the device belongs to different distance ranges. For example, if the user is closer to the voice chip or device, less than a first distance threshold, a hardware VAD mode may be used; if the user is a little bit away from the voice chip or device, e.g., greater than the first distance threshold but less than the second distance threshold, a software VAD mode may be used; if the user is far away from the voice chip or device, e.g., greater than a second distance threshold, a soft-hard combined VAD mode may be used. Optionally, the distance from the user to the voice chip or the device may be determined according to the collected information such as the amplitude and the direction of the user's voice.
In an alternative embodiment B4, the use of VAD mode is combined with upper layer application state related to the voice chip or device. Based on the above, the corresponding upper application state of the voice chip or the device when using various VAD modes can be configured according to the strategy configuration instruction. Accordingly, during the application process, the state of the upper layer application related to the voice chip or the device can be monitored; the VAD mode currently used by the voice chip or device is configured according to the state of an upper layer application related to the voice chip or device.
In this embodiment, the upper layer application related to the voice chip or the device is not limited, and may be any application that may use a voice function, such as a navigation application, a map application, or an instant messaging application.
Alternatively, the state of the upper application may be divided into an operating state and a non-operating state. Considering that the upper layer application in the running state has a high probability of using the voice chip, in order to perform voice detection more accurately, a software VAD mode or a soft-hard combination VAD mode may be used; conversely, for upper layer applications that are not running, a hardware VAD mode may be used. Based on this, the information corresponding to the running state of the upper layer application and the software VAD mode or the software and hardware combined VAD mode and the information corresponding to the non-running state of the upper layer application and the hardware VAD mode can be carried in the policy configuration instruction and provided to the voice chip or the device; and the voice chip or the equipment configures the running state of the upper-layer application corresponding to the software VAD mode or the software-hardware combined VAD mode according to the information carried in the strategy configuration instruction, and configures the non-running state of the upper-layer application corresponding to the hardware VAD mode. Accordingly, when the VAD mode currently used by the voice chip or the device is configured according to the running state of the upper layer application associated with the voice chip or the device, the state of the upper layer application associated with the voice chip or the device can be monitored; if the upper layer application associated with the voice chip or the equipment is in a running state, configuring the VAD mode currently used by the voice chip or the equipment as a software VAD mode or a soft-hard combined VAD mode; and if the upper layer application associated with the voice chip or the equipment is in a non-running state, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
In an alternative embodiment B5, the usage sequence and usage duration of the VAD modes can be preset, the VAD modes are used in sequence according to the usage sequence, and the usage duration of each VAD mode is the set usage duration. Based on this, the usage sequence and usage duration of the various VAD modes can be configured according to the policy configuration instruction. The strategy configuration instruction comprises the use sequence and the use duration of various VAD modes. Accordingly, in the application process, according to the usage sequence and usage duration of the multiple VAD modes configured in the VAD mode usage strategy, at the end of the usage duration of the previous VAD mode, the VAD mode currently used by the voice chip or device (i.e. the next VAD mode adjacent to the previous VAD mode) is configured.
In the present embodiment, the order of use and the use period of each VAD mode are not limited. For example, the sequence of usage of the various VAD modes is: soft and hard combined VAD mode, software VAD mode and hardware VAD mode; or a hardware VAD mode, a software VAD mode, and a soft-hard combined VAD mode. In addition, considering that the power consumption of the hardware VAD mode is lower, which is beneficial to saving power, the use time of the hardware VAD mode can be set longer, for example, 10 hours; accordingly, the software VAD mode and the soft-hard VAD mode consume relatively much power, and the usage time of the software VAD mode and the soft-hard VAD mode can be set to be shorter, for example, 2 hours, in order to save power. Of course, the duration of use of the three VAD modes may also be the same.
Further, the usage sequence and/or usage duration of the various VAD modes can be dynamically adjusted according to the environmental information of the voice chip or the device. For example, if the voice chip or the device is in a relatively noisy environment for a long time, the usage duration of the software VAD mode and the soft-hard combined VAD mode can be increased, and the accuracy of the voice detection result can be ensured as much as possible.
In an alternative embodiment B6, the use of VAD mode is combined with environmental information where the voice chip or device is located. Based on the above, the corresponding environment information of the voice chip or the device when using various VAD modes can be configured according to the strategy configuration instruction. Correspondingly, in the application process, the current environment information of the voice chip or the equipment can be monitored; and configuring the VAD mode currently used by the voice chip or the equipment according to the environment information of the voice chip or the equipment currently located.
Optionally, the configuration personnel may obtain in advance environment information in which the voice chip or the device needs to be located, and carry the environment information in which the voice chip or the device needs to be located in the policy configuration instruction, that is, the policy configuration instruction includes the environment information in which the voice chip or the device needs to be located. For example, a configurator may enter environmental information on a Web page or App page and then click a configuration control on the page to issue a policy configuration instruction. Based on the above, the voice chip or the voice equipment can analyze the environmental information where the voice chip or the voice equipment needs to be located from the strategy configuration instruction; and configuring the corresponding environment information of the voice chip or the equipment when various VAD modes are used according to the environment information in which the voice chip or the equipment needs to be positioned.
Optionally, the configuration staff may also instruct the voice chip or the device to acquire the environment information in which the voice chip or the device needs to be located, and based on this, the instruction for acquiring the environment information may be carried in the policy configuration instruction, that is, the policy configuration instruction includes the instruction for acquiring the environment information. For example, a configurator may input an instruction indicating to collect environment information on a Web page or an App page and then click a configuration control on the page to issue a policy configuration instruction. Based on the above, the voice chip or the device can acquire the environmental information in which the voice chip or the device needs to be located according to the instruction for acquiring the environmental information indicated in the policy configuration instruction; and configuring the corresponding environment information of the voice chip or the equipment when various VAD modes are used according to the environment information in which the voice chip or the equipment needs to be positioned.
In the embodiment of the present application, there is no limitation on the environment information where the voice chip or the device needs to be located, for example, the environment information may include but is not limited to: the voice chip or the device is located in at least one of the environment position, the environment type, the environment noise level, the environment noise category, and the time information. The ambient noise level may be classified according to the decibel of the sound, and may include, but is not limited to, the following categories: very noisy, generally noisy, relatively quiet, very quiet, and the like; the environment types may include, but are not limited to: home environments, office environments, entertainment venues, public places, and the like; the ambient noise categories may include, but are not limited to: human voice noise, non-human noise; further, non-human noise can be further classified as: animal noise, construction noise, traffic noise, etc.; the time information may be divided into day and night, and may also be divided into morning, afternoon, evening, and the like. It should be noted that several types of environment information such as the environment location, the environment type, the environment noise level, the environment noise type, and the time information may be used alone or in combination in any manner. For example, one type of environmental information includes: the office environment in the daytime is quiet; another type of environment information includes: entertainment venues at night are very noisy; yet another environment information includes: a home environment during the day; and so on.
Based on the above, the corresponding environment information of the voice chip or device when using various VAD modes can be configured according to at least one of the environment location, the environment type, the environment noise degree, the environment noise category and the time information corresponding to the environment information where the voice chip or device is located. Accordingly, the corresponding environment information of the voice chip or device when using various VAD modes also includes: at least one of an environmental location, an environmental type, an environmental noisiness, an environmental noise category, and time information. Correspondingly, the embodiment of configuring the VAD mode currently used by the voice chip or the device according to the environment information where the voice chip or the device is currently located includes: and configuring the VAD mode currently used by the voice chip or the equipment according to at least one of the environment position, the environment type, the environment noise degree, the environment noise category and the time information corresponding to the environment where the voice chip or the equipment is currently located. The VAD mode currently used by the configured voice chip or device is different according to different environment information. The following is illustrated in three scenarios:
scene C1: and configuring the VAD mode currently used by the voice chip or the equipment according to the environment noise degree corresponding to the environment where the voice chip or the equipment is located. For example, it is known or known in advance that the environment corresponding to the environment where the voice chip or the device is located is noisy, if the environment is noisy greater than the set noisy threshold, it indicates that the environment is noisy, and the false triggering probability is higher when using the hardware VAD mode, so that the currently used VAD mode of the voice chip or the device can be configured to be a software VAD mode or a software and hardware combined VAD mode, which is beneficial to improving the accuracy of the subsequent voice input detection result; if the ambient noise is less than or equal to the set noise threshold, which indicates that the environment is relatively quiet, the VAD mode currently used by the voice chip or device may be configured to be a hardware VAD mode, which is beneficial to the low power consumption performance of the voice chip or device.
Scene C2: and configuring the VAD mode currently used by the voice chip or the equipment according to the time information corresponding to the environment where the voice chip or the equipment is located. For example, the time may be divided into two time periods of day and night, and compared to night, the daytime environment is relatively noisy and complex, so that when the time information is the time period corresponding to the day, that is, when the voice chip or the device is in the daytime period, the VAD mode currently used by the voice chip or the device is configured to be a software VAD mode or a soft-hard combination VAD mode, which is beneficial to improving the accuracy of the subsequent voice input detection result; when the time information is a time period corresponding to night, namely the voice chip or the equipment is in the night time period, the VAD mode currently used by the voice chip or the equipment is configured to be a hardware VAD mode, and therefore the low power consumption performance of the voice chip or the equipment is facilitated.
Scene C3: and configuring the VAD mode currently used by the voice chip or the equipment according to the environmental noise category corresponding to the environment where the voice chip or the equipment is located. For example, the environmental noise may be classified into two categories, namely, human noise and non-human noise, and if the category of the environmental noise includes human noise, in order to better distinguish the human noise from the effective voice signal, the VAD mode currently used by the voice chip or device may be configured to be a software VAD mode or a soft-hard VAD mode, which is beneficial to improving the accuracy of the subsequent voice input detection result; if the environmental noise category does not include human noise, the VAD mode currently used by the voice chip or device can be configured to be a hardware VAD mode, which is beneficial to the low-power performance of the voice chip or device.
The above three scenarios can be used in combination with the four scenarios described in the foregoing embodiments, and specifically, according to the configuration manners in the above three scenarios, one of the multiple VAD modes supported by the voice chip or device is selected as the currently used VAD mode. For example, for case 1, if the voice chip or device supports the hardware VAD mode and the software VAD mode, in the scene C1, if the ambient noise is greater than the set noise threshold, the VAD mode currently used by the voice chip or device is configured to be the software VAD mode; and if the ambient noise is smaller than the set noise threshold, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode. For another example, in case 4, the voice chip or device supports hardware VAD, software VAD and soft-hard combined VAD, then in scene C2, if the time information is the time period corresponding to the day, the time period of the day has more human activities, and the VAD mode currently used by the voice chip or device is configured to be the software VAD mode or the soft-hard combined VAD mode; and if the time information is a time period corresponding to night, and the activity of human beings is less in the night time period, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
It should be noted that, in a low power consumption scenario, the scheme provided by the embodiment of the present application may be used to reduce the false trigger probability for the voice chip or device, and implement low power consumption control for the voice chip or device. In the low power consumption scenario, performing low power consumption control on a voice chip or a device is only one application scenario of the technical solution provided in the embodiment of the present application, and is not limited thereto. The technical solution provided in the embodiment of the present application can also be applied to voice endpoint detection, and can also be applied to job control of various language devices, and the following embodiments are described in detail.
Fig. 3 is a flowchart illustrating a voice endpoint detection method according to an exemplary embodiment of the present application. The method is suitable for a voice chip or equipment, and the voice chip or equipment has the functions of hardware VAD and software VAD. For the related description of the voice chip or device and the hardware VAD function and the software VAD function, refer to the foregoing embodiments, which are not described herein again. As shown in fig. 3, the method includes:
31. collecting sound signals input into a voice chip or equipment;
32. performing VAD processing on the voice signal by utilizing a VAD mode currently used by a voice chip or equipment; among them, the currently used VAD mode is one of a plurality of VAD modes generated by the combined use of the hardware VAD and the software VAD.
In the embodiment of the present application, an application scenario of the voice chip or the device is not limited, and the voice chip and the device may be applied to various scenarios. For example, the intelligent household appliance voice control method can be applied to intelligent household appliances such as a sweeping robot, an air conditioner and a television, and further can control the intelligent household appliances through voice. For another example, the method can also be applied to vehicle navigation equipment, namely voice navigation is carried out during driving, such as route planning, vehicle speed inquiry and the like. Also, for example, it is also applicable to ticket vending machines, public lighting systems, and the like.
No matter which application scenario, whether a voice signal is input or not needs to be identified, and further, the voice signal can be further processed under the condition that the voice signal is input. Wherein the further processing of the speech signal may be: performing text conversion at the local terminal, and identifying an operation instruction carried by the voice signal; or sending the voice signal to a cloud end, and identifying an operation instruction carried by the voice signal by the cloud end; and so on. The process of identifying whether a voice signal is input is a voice endpoint detection process. In this embodiment, whether the sound signal includes a voice signal is detected by using a voice chip or a VAD mode currently used by the device. For a detailed embodiment of detecting whether the voice signal includes the voice signal by using the voice chip or the VAD mode currently used by the device, reference may be made to the foregoing embodiment, and details are not repeated here.
In this embodiment, the method shown in fig. 2 may be used to pre-configure VAD mode usage strategies required by a voice chip or device when using various VAD modes; furthermore, in the actual application process, the VAD mode currently used by the voice chip or device may be flexibly configured according to the VAD mode usage policy and in combination with the environment information where the voice chip or device is located, the related user attribute, the state of the upper layer application, the remaining power, and other information.
For example, in a scene of controlling the smart home devices by using voice, a voice chip is built in the smart home devices such as a sweeping robot, an air conditioner and a television, and the voice chip is configured in advance to use a software VAD mode in a daytime period and use a hardware VAD mode in a night period.
The application of the voice chip in the sweeping robot is as follows:
for the flexibility of operation, the sweeping robot is supplied with power by a storage battery, so that the sweeping robot is not limited by the position of a household power socket and a charging wire. In order to save the battery power, the sweeping robot works in a low power consumption mode under the condition that no voice signal is input. A user sends a cleaning instruction in a voice mode towards the sweeping robot in the daytime, such as 'cleaning living room'; a voice chip arranged in the sweeping robot collects voice signals in the surrounding environment and detects whether the voice signals comprise voice signals or not by utilizing a software VAD mode; when detecting that the sound signal contains a voice signal, the sweeping robot enters a normal working mode from a low power consumption mode; after the floor sweeping robot enters a normal working mode, the voice chip continues to recognize the voice signal and judges whether the voice signal contains a set instruction word, such as cleaning + living room, cleaning + kitchen, or cleaning. In this embodiment, the voice chip recognizes that the voice signal includes an instruction word, that is, cleaning + living room, and reports the recognition result to the processor of the floor sweeping robot, and the processor of the floor sweeping robot controls the floor sweeping robot to perform a cleaning task in the living room according to the instruction word recognized by the voice chip.
If the user faces the sweeping robot at night, a sweeping instruction is sent out in a voice mode, such as 'cleaning living room'; a voice chip arranged in the sweeping robot collects voice signals in the surrounding environment and detects whether the voice signals contain voice signals or not by utilizing a hardware VAD mode; and when the voice signal is detected to contain the voice signal, the sweeping robot enters a normal working mode from a low power consumption mode. For the related operations after the sweeping robot enters the normal operation mode, reference is made to the foregoing description, and details are not repeated here.
The application of the voice chip in the air conditioner is as follows:
an air conditioner is a relatively power consuming device, and is generally installed near a power socket and supplied with power by an ac power source. A voice chip is arranged in the air conditioner, and a user can carry out voice control on the air conditioner. A user gives a refrigeration instruction in a voice mode facing an air conditioner at night, for example, the air conditioner is turned on, and the temperature is 27 ℃; a voice chip arranged in the air conditioner can collect voice signals in the surrounding environment and detect whether the voice signals contain voice signals or not by utilizing a hardware VAD mode; when the voice signal is detected to contain the voice signal, the voice signal is continuously identified, and whether the voice signal contains a set instruction word, such as an opening + temperature value, a shutdown, or an opening + working mode name, is judged. In this embodiment, the voice chip recognizes that the voice signal contains an instruction word, that is, the voice signal is turned on at +27 ℃, the recognition result is reported to the processor of the air conditioner, and the processor of the air conditioner controls the refrigeration system of the air conditioner to start working and sets the refrigeration temperature to 27 ℃ according to the instruction word recognized by the voice chip.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 11 to 13 may be device a; for another example, the execution subject of steps 11 and 12 may be device a, and the execution subject of step 13 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the order of the operations such as 11, 12, etc. is merely used for distinguishing different operations, and the order itself does not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit the types of "first" and "second".
Fig. 4a is a schematic structural diagram of a speech chip according to an exemplary embodiment of the present application. The voice chip has both hardware VAD function and software VAD function. For the related description of the hardware VAD function and the software VAD function, refer to the foregoing embodiments, and are not described herein again. The voice chip of the present embodiment supports multiple VAD modes, and allows to flexibly configure the VAD mode used by the voice chip according to the application scenario where the voice chip is located or is to be located, the time information, and/or the user preference. As shown in fig. 4a, the voice chip 40 includes: a pickup module 41, a hardware VAD module 42, a processor 43, and a memory 44.
In the embodiment of the present application, an application manner of the voice chip is not limited. For example, the voice chip may be applied to a low power consumption or energy saving device, which may include but is not limited to: a remote controller powered by a battery, a story machine, a smart speaker, a tablet computer, a smart phone, a smart alarm clock, a smart bracelet, a smart switch, a smart speaker, a smart robot, an unmanned delivery vehicle, a self-service express cabinet or a self-service terminal, etc. Also for example, the voice chip may be implemented as a stand-alone voice device.
And the sound pickup module 41 is used for collecting the sound signal input into the voice chip 40. The pickup module 41 may be a microphone or a microphone. In the embodiment of the present application, the sound signal input to the voice chip 40 is not limited. The sound signal input to the voice chip 40 may include, but is not limited to: speech signals, human noise, environmental noise, etc.
A hardware VAD module 42 for detecting whether the sound signal includes a voice signal in a hardware manner when the VAD mode currently used indicates that the hardware VAD function of the voice chip 40 is enabled. The currently used VAD mode is one of a plurality of VAD modes generated by using a combination of a hardware VAD function and a software VAD function, and may be, for example, a hardware VAD mode, a software VAD mode, or a soft-hard combined VAD mode.
The memory 44 stores therein a VAD program and a power consumption control program. A processor 43, configured to execute a VAD program to detect whether the sound signal includes a voice signal in a software manner when the currently used VAD mode indicates that the software VAD function of the voice chip is enabled.
The voice chip 40 of the present embodiment supports a low power consumption scheme, that is, when there is no voice signal, the voice chip 40 is in a low power consumption mode, and only when the voice signal appears, the normal operating mode is entered. Further, the processor 43 is further configured to execute a power consumption control program for: and controlling the voice chip to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used. Optionally, the processor 43 is further configured to: and controlling the voice chip to be kept in the low power consumption mode when the voice signal is detected not to be contained in the voice signal by using the VAD mode currently used.
The signal profile of the pickup module 41 will vary according to the VAD mode currently used. Alternatively, as shown in fig. 4b, when the currently used VAD mode is a soft-hard combined VAD mode, the sound pickup module 41 may specifically send the collected sound signals to the hardware VAD module and the processor 43, respectively. Alternatively, as shown in fig. 4b, when the currently used VAD mode is the hardware VAD mode, the sound pickup module 41 may specifically send the acquired voice signal to the hardware VAD module. Alternatively, if the processor 43 further processes the voice signal or the sound signal after the voice chip enters the normal operation mode, the sound pickup module 41 may also send the collected sound signal to the processor 43. Alternatively, as shown in fig. 4b, when the currently used VAD mode is the software VAD mode, the sound pickup module 41 may specifically send the collected sound signal to the processor 43.
Further optionally, as shown in fig. 4c, the voice chip 40 further includes: a switching module 45; the switching module 45 is connected between the sound pickup module 41, the hardware VAD module 42, and the processor 43. The switching module 45 is specifically configured to: the pickup module 41 is switched on with the hardware VAD module 42 and/or the processor 43, depending on the VAD mode currently in use.
In an alternative embodiment, no matter what VAD mode is currently used, for the hardware VAD module 42, if the voice signal sent by the sound pickup module 41 is received, whether the voice signal is included in the voice signal is detected in a hardware manner, and the detection result is reported to the processor 43.
The processor 43 will operate differently depending on the VAD mode currently used. The processor 43 is specifically configured to: when the currently used VAD mode is a soft-hard combined VAD mode, if it is determined that the voice signal is detected by the hardware VAD module according to the detection result reported by the hardware VAD module 42, whether the voice signal is included in the voice signal is detected again in a software mode, and when the voice signal is detected again in the software mode, the voice chip 40 is controlled to enter a normal working mode from a low-power-consumption mode. Alternatively, processor 43 is specifically configured to: when the currently used VAD mode is the hardware VAD mode, if it is determined that the voice signal is detected by the hardware VAD module according to the detection result reported by the hardware VAD module 42, the voice chip 40 is controlled to enter the normal working mode from the low power consumption mode. Alternatively, processor 43 is specifically configured to: when the currently used VAD mode is the software VAD mode, whether the voice signal includes the voice signal is detected in a software mode, and when the voice signal includes the voice signal is detected in the software mode, the voice chip 40 is controlled to enter the normal operation mode from the low power consumption mode.
Further alternatively, the multiple VAD modes supported by voice chip 40 may be preconfigured. For example, voice chip 40 supports a hardware VAD mode and a software VAD mode; or, support hardware VAD mode and soft and hard combined VAD mode; alternatively, a software VAD mode and a soft-hard combined VAD mode; alternatively, hardware VAD, software VAD and soft-hard combined VAD are supported. In addition, the VAD mode currently used by the voice chip 40 can also be flexibly configured. Further, in order to flexibly configure the VAD mode currently used by the voice chip 40, the VAD mode usage strategy required by the voice chip 40 when using various VAD modes may also be configured in advance. Based on this, the processor 43 is further configured to: and configuring the VAD mode currently used by the voice chip according to a preset VAD mode use strategy.
Optionally, when the VAD mode usage policy is configured in advance, the processor 43 is specifically configured to: in the initialization process of the voice chip, according to the strategy configuration instruction, configuring VAD mode use strategies required by the voice chip 40 when using various VAD modes; or, in the factory configuration process of the voice chip, according to the policy configuration instruction, configuring the VAD mode usage policy required by the voice chip 40 when using various VAD modes; or, during the use of the voice chip, the VAD mode usage strategy required by the voice chip 40 when using various VAD modes is reconfigured according to the strategy configuration instruction.
In an optional embodiment, the processor 43 is specifically configured to perform at least one of the following operations when the VAD mode usage policy is configured according to the policy configuration instruction:
configuring corresponding environment information of a voice chip when various VAD modes are used according to a strategy configuration instruction;
configuring the corresponding residual electric quantity range of the voice chip when using various VAD modes according to the strategy configuration instruction;
configuring user attributes corresponding to the voice chip when various VAD modes are used according to the strategy configuration instruction;
configuring the corresponding upper application state of the voice chip when using various VAD modes according to the strategy configuration instruction;
configuring the use sequence and the use duration of various VAD modes according to the strategy configuration instruction;
and according to the strategy configuration instruction, configuring identifications of various VAD modes for a user to specify the VAD mode used by the voice chip through the VAD configuration instruction.
Further, when configuring the corresponding environment information of the voice chip when using various VAD modes, the processor 43 is specifically configured to: acquiring environmental information in which a voice chip needs to be positioned according to a strategy configuration instruction; and configuring corresponding environment information of the voice chip when the voice chip uses various VAD modes according to the environment information in which the voice chip needs to be positioned.
As shown in fig. 4b, a configurator configures environmental information in which the voice chip needs to be located through an App page or a Web page on the terminal, and clicks a configuration control on the page to send a policy configuration instruction to the voice chip 40. The policy configuration instruction carries environment information in which the voice chip needs to be located. Based on this, the communication module 46 in the voice chip 40 receives the policy configuration instruction, and reports the policy configuration instruction to the processor 43; the processor 43 analyzes the environmental information where the voice chip needs to be located from the policy configuration instruction; and configuring corresponding environment information of the voice chip when the voice chip uses various VAD modes according to the environment information in which the voice chip needs to be positioned.
Or
As shown in fig. 4b, a configurator configures the remaining power ranges corresponding to the VAD modes through an App page or a Web page on the terminal, and clicks a configuration control on the page to send a policy configuration instruction to the voice chip 40. The policy configuration instruction carries remaining power ranges corresponding to various VAD modes. Based on this, the communication module 46 in the voice chip 40 receives the policy configuration instruction, and reports the policy configuration instruction to the processor 43; the processor 43 analyzes the remaining power ranges corresponding to the VAD modes from the policy configuration instruction; and configuring the corresponding remaining power range of the voice chip when various VAD modes are used.
Or
As shown in fig. 4b, a configurator configures the corresponding relationship between each VAD mode and the upper application state through an App page or a Web page on the terminal, and clicks a configuration control on the page to send a policy configuration instruction to the voice chip 40. The policy configuration instruction carries the corresponding relationship between various VAD modes and the upper application state. Based on this, the communication module 46 in the voice chip 40 receives the policy configuration instruction, and reports the policy configuration instruction to the processor 43; processor 43 analyzes the corresponding relationship between each VAD mode and the upper application state from the policy configuration instruction; and configuring the corresponding upper application state of the voice chip when various VAD modes are used.
Further optionally, the processor 43 is specifically configured to perform at least one of the following operations when the VAD mode currently used by the voice chip is configured according to the preconfigured VAD mode usage policy:
configuring a VAD mode currently used by the voice chip according to the current environment information of the voice chip;
configuring a VAD mode currently used by the voice chip according to the current residual electric quantity of the voice chip;
configuring a VAD mode currently used by the voice chip according to the user attribute of the currently used voice chip;
configuring a VAD mode currently used by the voice chip according to the state of an upper application associated with the voice chip;
according to the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy, when the use duration of the previous VAD mode is ended, the VAD mode currently used by a voice chip is configured;
and configuring the VAD mode currently used by the voice chip according to the VAD mode identifier carried in the VAD configuration instruction.
In the embodiment of the present application, the content of the environment information where the voice chip is located is not limited, for example, the environment information may include but is not limited to: the environment position where the voice chip is located, the environment type, the environment noise degree, the environment noise category, the time information and the like. It should be noted that several types of environment information such as the environment location, the environment type, the environment noise level, the environment noise type, and the time information may be used alone or in combination in any manner. Based on this, the processor 43 is specifically configured to: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information of the voice chip. The VAD mode currently used by the configured voice chip is different according to different environment information.
Optionally, the processor 43 is specifically configured to: and configuring the VAD mode currently used by the voice chip according to the noise degree of the environment corresponding to the current environment where the voice chip is located. For example, if the ambient noise is greater than the set noise threshold, the VAD mode currently used by the voice chip is configured to be a software VAD mode or a soft-hard combined VAD mode; and if the ambient noise is less than or equal to the set noise threshold, configuring the VAD mode currently used by the voice chip as a hardware VAD mode.
Optionally, the processor 43 is specifically configured to: and configuring the VAD mode currently used by the voice chip according to the time information corresponding to the current environment of the voice chip. For example, the time may be divided into two time periods of day and night, and when the time information is the time period corresponding to the day, that is, when the voice chip is in the day time period, the VAD mode currently used by the voice chip is configured to be a software VAD mode or a soft-hard combined VAD mode; and when the time information is a time period corresponding to night, namely the voice chip is in the night time period, configuring the VAD mode currently used by the voice chip as a hardware VAD mode.
Optionally, the processor 43 is specifically configured to: and configuring the VAD mode currently used by the voice chip according to the environmental noise category corresponding to the current environment where the voice chip is located. For example, the environmental noise can be classified into two categories, namely human noise and non-human noise, and if the category of the environmental noise includes human noise, the VAD mode currently used by the voice chip is configured to be a software VAD mode or a soft-hard combined VAD mode; if the environmental noise category does not include human noise, the VAD mode currently used by the voice chip is configured to be a hardware VAD mode.
Further optionally, when the VAD mode currently used by the voice chip is configured according to the current remaining power of the voice chip, the processor 43 is specifically configured to: if the current residual electric quantity of the voice chip is larger than or equal to the first electric quantity threshold value, configuring the VAD mode currently used by the voice chip as a soft-hard combined VAD mode; if the current residual electric quantity of the voice chip is greater than or equal to the second electric quantity threshold value and smaller than the first electric connection threshold value, configuring the VAD mode currently used by the voice chip as a software VAD mode; if the residual electric quantity of the voice chip is smaller than the second electric quantity threshold value, configuring the VAD mode currently used by the voice chip as a hardware VAD mode; wherein the first charge threshold is greater than the second charge threshold.
Further optionally, when configuring the VAD mode currently used by the voice chip according to the attribute of the user currently using the voice chip, the processor 43 is specifically configured to: if the current user using the voice chip is a set user type or a set user, configuring the VAD mode currently used by the voice chip as a software VAD mode or a soft-hard combined VAD mode; and if the current user using the voice chip is a non-set user type or a non-set user, configuring the VAD mode currently used by the voice chip as a hardware VAD mode.
Further optionally, when configuring the VAD mode currently used by the voice chip according to the user attribute currently used by the voice chip, the processor 43 is specifically configured to: if the volume of a user using the voice chip is smaller than a set first volume threshold value, configuring the VAD mode currently used by the voice chip as a soft-hard combined VAD mode; if the volume of the user using the voice chip is larger than or equal to the first volume threshold but smaller than the second volume threshold, configuring the VAD mode currently used by the voice chip as a software VAD mode; if the volume of the user using the voice chip is larger than a second volume threshold value, configuring the VAD mode currently used by the voice chip as a hardware VAD mode; wherein the first volume threshold is less than the second volume threshold.
Further optionally, when configuring the VAD mode currently used by the voice chip according to the state of the upper layer application associated with the voice chip, the processor 43 is specifically configured to: if the upper layer application associated with the voice chip is in a running state, configuring the VAD mode currently used by the voice chip as a software VAD mode or a soft-hard combined VAD mode; and if the upper layer application associated with the voice chip is in a non-running state, configuring the VAD mode currently used by the voice chip as a hardware VAD mode.
Further optionally, processor 43 is further configured to: and dynamically adjusting the use sequence and/or the use duration of each VAD mode according to the environmental information of the voice chip.
Further optionally, when the VAD mode identifier carried in the VAD configuration instruction configures the VAD mode currently used by the voice chip, the processor 43 is specifically configured to: analyzing a VAD mode identifier from the VAD configuration instruction, wherein the VAD mode identifier is used for identifying a VAD mode specified by a user, and the specified VAD mode is a hardware VAD mode, a software VAD mode or a soft-hard combined VAD mode; and configuring the VAD mode currently used by the voice chip to the specified VAD mode.
Further, as shown in fig. 4c, the voice chip 40 may further include: a communication module 46, an analog-to-digital (a/D) module 47, a digital-to-analog (D/a) module 48, an audio output component (e.g., speaker) 49, and a power supply component 491. The a/D module 47 is connected between the sound pickup module 41 and the hardware VAD module 42, and is configured to perform analog-to-digital conversion on the sound signal acquired by the sound pickup module 41 and send the sound signal to the hardware VAD module 42. It should be noted that the positions of the a/D module 47 and the switching module 45 are not limited, for example, as shown in fig. 4c, the a/D module 47 is located in front of the switching module 45, or the switching module 45 is located in front of the a/D module 47. The D/a module 48 is connected between the audio output module 49 and the processor 43, and is configured to convert the digital signal output by the processor 43 into an analog signal, and send the analog signal to the audio output module 49 for the audio output module 49 to output.
In this embodiment, the voice chip has a VAD function, and whether a voice signal is input or not can be detected by using the VAD function, and when a voice signal is detected, the voice chip enters a normal operating mode from a low power consumption mode, so that the power consumption of the voice chip can be saved; furthermore, the voice chip has both hardware VAD function and software VAD function, multiple VAD modes can be generated by combining the hardware VAD function and the software VAD function, the accuracy of the voice input detection result can be improved to a certain extent by flexibly configuring the VAD mode used by the voice chip, the false triggering probability is reduced, and the low power consumption performance of the voice chip is improved.
Fig. 5 is a schematic structural diagram of another speech chip according to an exemplary embodiment of the present application. The voice chip has both hardware VAD function and software VAD function. For the related description of the hardware VAD function and the software VAD function, refer to the foregoing embodiments, and are not described herein again. The voice chip of the present embodiment supports multiple VAD modes, and allows to flexibly configure the VAD mode used by the voice chip according to the application scenario where the voice chip or the device is located or is to be located, the time information, and/or the user preference. As shown in fig. 5, the voice chip 50 includes: a pickup module 51, a hardware VAD module 52, a main processor 53, a co-processor 56 and a memory 54. The coprocessor 56 primarily assists the main processor 53 in performing processing tasks that it cannot perform or that it performs efficiently and effectively. In the present embodiment, the coprocessor 56 is mainly used to replace the main processor 53 to perform some work when the main processor 53 is in the low power mode, and is responsible for waking up the main processor 53.
In the embodiment of the present application, an application manner of the voice chip is not limited. For example, the voice chip may be applied to a low power consumption or energy saving device, which may include but is not limited to: a remote controller powered by a battery, a story teller, a smart sound box, a tablet computer, a smart phone, a sweeping robot, and the like. Also for example, the voice chip may be implemented as a standalone voice device, or as a standalone application.
And the sound pickup module 51 is used for collecting the sound signal input into the voice chip 50. The pickup module 51 may be a microphone or a microphone. In the embodiment of the present application, the sound signal input to the voice chip 50 is not limited. The sound signal input to the voice chip 50 may include, but is not limited to: speech signals, human noise, environmental noise, etc.
A hardware VAD module 52 for detecting whether the sound signal includes a voice signal in a hardware manner when the currently used VAD mode indicates that the hardware VAD function of the voice chip is enabled. The currently used VAD mode is one of multiple VAD modes generated by combining a hardware VAD function and a software VAD function, and may be, for example, a hardware VAD mode, a software VAD mode, or a hard-and-soft VAD mode.
The memory 54 stores therein a VAD program and a power consumption control program. The co-processor 56 is used for executing the VAD program to detect whether the sound signal contains the voice signal in a software mode when the currently used VAD mode indicates that the software VAD function of the voice chip is enabled.
The voice chip 50 of the present embodiment supports a low power consumption scheme, that is, when there is no voice signal, the voice chip 50 is in a low power consumption mode, at this time, the main processor 53 does not work, and the main processor 53 enters a normal working mode only when the voice signal appears. Further, the coprocessor 56 is further configured to execute a power consumption control program for: when the voice signal is detected to be included in the voice signal in the currently used VAD mode, the main processor 53 is controlled to enter the normal operation mode from the low power consumption mode. Further, the main processor 53 may wake up other hardware modules or functions in the voice chip 50. Optionally, the co-processor 56 is further configured to: in the case where it is detected that the voice signal is not included in the voice signal using the currently used VAD mode, the voice chip 50 is controlled to remain in the low power consumption mode.
The embodiment shown in fig. 5 differs from the embodiment shown in fig. 4a-4c mainly in that: the voice chip 50 includes a main processor 53 and a coprocessor 56. The function of the coprocessor 56 is similar to that of the processor 43 in the embodiment shown in fig. 4a to 4c, and is not described herein again, which can refer to the embodiment shown in fig. 4a to 4 c. The functions of the other modules except the main processor 53 and the coprocessor 56 are similar to the functions of the corresponding modules in the embodiments shown in fig. 4a to 4c, and are not described herein again, which can be seen in the embodiments shown in fig. 4a to 4 c.
Fig. 6 is a schematic structural diagram of an intelligent terminal according to an exemplary embodiment of the present application. As shown in fig. 6, the intelligent terminal 60 includes a voice chip 65, and the voice chip 65 includes: a pickup module 61, a hardware VAD module 62, a processor 63, and a memory 64.
The memory 64 stores therein a VAD program and a power consumption control program. And the pickup module 61 is used for collecting sound signals input into the voice chip. A hardware VAD module 62 for detecting whether the sound signal includes a voice signal in a hardware manner when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled. A processor 63, configured to execute a VAD program to detect whether the sound signal includes a voice signal in a software manner when the currently used VAD mode indicates that the software VAD function of the voice chip is enabled. Further, the processor 63 is further configured to execute a power consumption control program for: and controlling the intelligent terminal to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to be contained in the voice signal by utilizing the VAD mode currently used. The currently used VAD mode is one of a plurality of VAD modes generated by using a combination of a hardware VAD function and a software VAD function, and may be, for example, a hardware VAD mode, a software VAD mode, or a soft-hard combined VAD mode.
The intelligent terminal shown in fig. 6 is different from the voice chip shown in fig. 4a-4c in that: the product forms are different. The intelligent terminal shown in fig. 6 is in a device form, and may be, for example, an electronic device powered by a battery, such as a remote controller, a story machine, an intelligent sound box, a tablet computer, a smart phone, an intelligent alarm clock, an intelligent bracelet, an intelligent switch, an intelligent speaker, an intelligent robot, an unmanned delivery vehicle, a self-service express cabinet or a self-service terminal, or an equipment powered by a battery-independent power source, such as an air conditioner, a refrigerator, or a television. For a detailed description of each module in the intelligent terminal shown in fig. 6, reference may be made to descriptions of corresponding modules in the embodiments shown in fig. 4a to 4 c. Further, as shown in fig. 6, the intelligent terminal further includes: communication components 66, power components 68, and a display 67. Wherein the components within the dashed box are optional components, not mandatory components.
Fig. 7 is a schematic structural diagram of another intelligent terminal provided in an exemplary embodiment of the present application. As shown in fig. 7, the intelligent terminal 70 includes: a voice chip 75 and a main processor 73; the voice chip 75 includes: a pickup module 71, a hardware VAD module 72, a coprocessor 76 and a memory 74; the memory 74 stores therein a VAD program and a power consumption control program.
And the sound pickup module 71 is used for collecting sound signals input into the voice chip. A hardware VAD module 72 for detecting whether the sound signal includes a voice signal in a hardware manner when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled. The co-processor 76 is used for executing the VAD program to detect whether the sound signal includes a voice signal in a software manner when the currently used VAD mode indicates that the software VAD function of the voice chip is enabled. Further, the coprocessor 76 is also configured to execute a power consumption control program for: when it is detected that the voice signal is included in the voice signal by the currently used VAD mode, the main processor 73 is controlled to enter the normal operation mode from the low power consumption mode. The currently used VAD mode is one of a plurality of VAD modes generated by using a combination of a hardware VAD function and a software VAD function, and may be, for example, a hardware VAD mode, a software VAD mode, or a soft-hard combined VAD mode.
The intelligent terminal shown in fig. 7 is different from the voice chip shown in fig. 5 in that: the product forms are different. The intelligent terminal shown in fig. 7 is in a device form, and may be, for example, an electronic device powered by a battery, such as a remote controller, a story machine, an intelligent sound box, a tablet computer, a smartphone, an intelligent alarm clock, an intelligent bracelet, an intelligent switch, an intelligent speaker, an intelligent robot, an unmanned delivery vehicle, a self-service express cabinet or a self-service terminal, or an equipment powered by a battery-independent power source, such as an air conditioner, a refrigerator, or a television. For a detailed description of each module in the intelligent terminal shown in fig. 7, reference may be made to fig. 5 and descriptions of corresponding modules in the embodiments shown in fig. 4a to 4 c. Further, as shown in fig. 7, the intelligent terminal further includes: communication components 77, power components 79, and a display 78. Wherein the components within the dashed box are optional components and not mandatory components.
Besides the above intelligent terminal, an embodiment of the present application further provides an autonomous service terminal, including: a voice chip and a main processor; the voice chip comprises a pickup module, a hardware VAD module, a coprocessor and a memory; the memory stores VAD program and power consumption control program; the pickup module is used for collecting sound signals input into the voice chip; the hardware VAD module is used for detecting whether the sound signal contains a voice signal in a hardware mode when the currently used VAD mode indicates that the hardware VAD function of the voice chip is enabled; the coprocessor is used for executing a VAD program to detect whether the sound signal contains a voice signal in a software mode when the currently used VAD mode indicates that the software VAD function of the voice chip is enabled; a coprocessor further configured to execute a power consumption control program for: controlling a main processor to enter a normal working mode from a low power consumption mode under the condition that a voice signal is detected to contain by utilizing a currently used VAD mode; wherein the currently used VAD mode is one of a plurality of VAD modes generated by using a hardware VAD function and a software VAD function in combination.
The difference between the self-service terminal provided in this embodiment and the voice chip shown in fig. 5 is that: the product forms are different. The self-service terminal provided by the embodiment can be a supermarket POS machine, a bank self-service cash dispenser, a self-service shopping guide service terminal in a market, an airport and other scenes, and the like. For a detailed description of the modules in the kiosk, reference is made to fig. 5 and the description of the corresponding modules in the embodiment shown in fig. 4a-4 c.
Embodiments of the present application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the method embodiment shown in fig. 1.
Embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program, when executed by a processor, causes the processor to implement the steps in the method embodiment shown in fig. 2.
An embodiment of the present application further provides a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the method embodiment shown in fig. 3.
The memory in the above embodiments is used for storing computer programs and may be configured to store other various data to support operations on the chip or device to which it belongs. Examples of such data include instructions, messages, pictures, audio, etc. for any application or method operating on the chip or device to which the memory belongs.
The memories of the above embodiments may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication component in the above embodiments is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply assembly of the above embodiments provides power to various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
The audio output component in the above embodiments may be configured to output an audio signal. For example, the audio output assembly includes a speaker (or horn) for outputting audio signals.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (36)

1. A power consumption control method is suitable for a voice chip or equipment, and is characterized in that the voice chip or equipment has a hardware VAD function and a software VAD function; the method comprises the following steps:
collecting sound signals input into a voice chip or equipment;
detecting whether the sound signal contains a voice signal or not by utilizing a voice chip or a VAD mode currently used by equipment;
if the sound signal comprises a voice signal, the voice chip or the equipment enters a normal working mode from a low power consumption mode;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from a combined use of a hardware VAD function and a software VAD function, the plurality of VAD modes comprising: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip or apparatus supports at least two of the plurality of VAD modes;
before utilizing the VAD mode currently used by the voice chip or the equipment, the method further comprises the following steps: according to a preset VAD mode use strategy, configuring a VAD mode currently used by a voice chip or equipment;
the method for configuring the VAD mode currently used by the voice chip or the equipment according to the preset VAD mode use strategy comprises the following steps: and configuring the VAD mode currently used by the voice chip or the equipment according to at least one of the current environment information, the current residual capacity, the user attribute, the state of the associated upper application, the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
2. The method of claim 1, further comprising:
if the sound signal does not contain a voice signal, the voice chip or the equipment is kept in a low power consumption mode.
3. The method of claim 1, wherein detecting whether the voice signal comprises a voice signal using the currently used VAD mode when the currently used VAD mode is a soft-hard combined VAD mode comprises:
sending the sound signal into a hardware VAD module in the voice chip or the equipment, and detecting whether the sound signal contains a voice signal in a hardware mode;
if the sound signal is detected to contain the voice signal in a hardware mode, the sound signal is sent to a processor in the voice chip or the equipment, and whether the sound signal contains the voice signal is detected again in a software mode;
and if the voice signal is detected to contain the voice signal again in a software mode, determining that the voice signal contains the voice signal.
4. The method of claim 1, wherein pre-configuring a VAD mode usage policy comprises:
when a voice chip or equipment is initialized, a VAD mode use strategy is configured according to a strategy configuration instruction;
or
And when the voice chip or the equipment is subjected to factory configuration, configuring the VAD mode using strategy according to the strategy configuration instruction.
5. The method of claim 4, wherein configuring the VAD mode usage policy according to the policy configuration instructions comprises at least one of:
according to the strategy configuration instruction, configuring corresponding environment information of a voice chip or equipment when various VAD modes are used;
configuring a corresponding residual electric quantity range of a voice chip or equipment when various VAD modes are used according to a strategy configuration instruction;
configuring user attributes corresponding to voice chips or equipment when various VAD modes are used according to the strategy configuration instruction;
configuring the corresponding upper application state of the voice chip or the equipment when using various VAD modes according to the strategy configuration instruction;
configuring the use sequence and the use duration of various VAD modes according to the strategy configuration instruction;
and configuring the identifications of various VAD modes according to the strategy configuration instruction so that the VAD modes used by the voice chip or equipment are specified by the user through the VAD configuration instruction.
6. The method of claim 5, wherein configuring the environment information corresponding to the voice chip or the device in the various VAD modes according to the policy configuration instruction comprises:
acquiring environmental information in which a voice chip or equipment needs to be located according to a strategy configuration instruction;
and configuring corresponding environment information of the voice chip or the equipment when each VAD mode is used according to the environment information in which the voice chip or the equipment needs to be positioned.
7. The method of claim 6, wherein the environment information corresponding to the voice chip or device when using various VAD modes comprises: at least one of an environmental location, an environmental type, an environmental noisiness, an environmental noise category, and time information.
8. The method of claim 1, wherein configuring the VAD mode currently used by the voice chip or the device according to the environmental information currently located by the voice chip or the device comprises:
and configuring the VAD mode currently used by the voice chip or the equipment according to at least one of the environment position, the environment type, the environment noise degree, the environment noise category and the time information corresponding to the environment where the voice chip or the equipment is currently located.
9. The method of claim 8, wherein configuring the VAD mode currently used by the voice chip or the device according to the ambient noise level corresponding to the current environment where the voice chip or the device is located comprises:
if the ambient noise is larger than the set noise threshold value, configuring the VAD mode currently used by the voice chip or the equipment to be a software VAD mode or a soft-hard combined VAD mode;
and if the ambient noise is less than or equal to the set noise threshold, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
10. The method of claim 8, wherein configuring the VAD mode currently used by the voice chip or the device according to the time information corresponding to the environment where the voice chip or the device is currently located comprises:
if the time information is a time period corresponding to the daytime, configuring a voice chip or equipment currently used VAD mode as a software VAD mode or a soft-hard combined VAD mode;
and if the time information is a time period corresponding to night, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
11. The method of claim 8, wherein configuring the VAD mode currently used by the voice chip or the device according to the environmental noise category corresponding to the environment where the voice chip or the device is currently located comprises:
if the environmental noise category comprises human noise, configuring a voice chip or a voice device currently used VAD mode as a software VAD mode or a soft-hard combined VAD mode;
and if the environmental noise category does not comprise human noise, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
12. The method of claim 1, wherein configuring the VAD mode currently used by the voice chip or the device according to the current remaining power of the voice chip or the device comprises:
if the current residual electric quantity of the voice chip or the equipment is larger than or equal to the first electric quantity threshold value, configuring the VAD mode currently used by the voice chip or the equipment as a soft-hard combined VAD mode;
if the current residual electric quantity of the voice chip or the equipment is greater than or equal to the second electric quantity threshold value and smaller than the first electric connection threshold value, configuring a VAD mode currently used by the voice chip or the equipment as a software VAD mode;
if the residual electric quantity of the voice chip or the equipment is smaller than the second electric quantity threshold value, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode;
wherein the first charge threshold is greater than the second charge threshold.
13. The method of claim 1, wherein configuring the VAD mode currently used by the voice chip or the device according to the user attribute of the currently used voice chip or device comprises:
if the current user using the voice chip or the equipment is a set user type or a set user, configuring the VAD mode currently used by the voice chip or the equipment as a software VAD mode or a soft-hard combined VAD mode;
and if the current user using the voice chip or the equipment is a non-set user type or a non-set user, configuring the VAD mode currently used by the voice chip or the equipment to be a hardware VAD mode.
14. The method of claim 1, wherein configuring the VAD mode currently used by the voice chip or the device according to the user attribute of the currently used voice chip or device comprises:
if the volume of a user using the voice chip or the equipment is smaller than a set first volume threshold, configuring the VAD mode currently used by the voice chip or the equipment as a soft-hard combined VAD mode;
if the volume of a user using the voice chip or the equipment is larger than or equal to the first volume threshold but smaller than the second volume threshold, configuring the VAD mode currently used by the voice chip or the equipment as a software VAD mode;
if the volume of the user using the voice chip or the equipment is larger than a second volume threshold value, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode;
wherein the first volume threshold is less than the second volume threshold.
15. The method of claim 1, wherein configuring a VAD mode currently used by the voice chip or device based on a state of an upper layer application associated with the voice chip or device comprises:
if the upper layer application associated with the voice chip or the equipment is in a running state, configuring the VAD mode currently used by the voice chip or the equipment as a software VAD mode or a soft-hard combined VAD mode;
and if the upper layer application associated with the voice chip or the equipment is in a non-operation state, configuring the VAD mode currently used by the voice chip or the equipment as a hardware VAD mode.
16. The method of claim 1, further comprising:
and dynamically adjusting the use sequence and/or the use duration of various VAD modes according to the environment information of the voice chip or the equipment.
17. The method of claim 1, wherein configuring the VAD mode currently used by the voice chip or the device according to the VAD mode identifier carried in the VAD configuration instruction comprises:
analyzing a VAD mode identifier from the VAD configuration instruction, wherein the VAD mode identifier is used for identifying a VAD mode designated by a user, and the designated VAD mode is a hardware VAD mode, a software VAD mode or a soft-hard combined VAD mode;
and configuring the VAD mode currently used by the voice chip or the equipment to be the designated VAD mode.
18. A VAD mode configuration method is applicable to a voice chip or equipment, and the voice chip or equipment is provided with hardware VAD function and software VAD function, and the method comprises the following steps:
receiving a policy configuration instruction;
according to the strategy configuration instruction, a voice chip or equipment is configured to use VAD mode use strategies required by various VAD modes;
wherein the plurality of VAD modes result from the combined use of hardware VAD functions and software VAD functions; the plurality of VAD modes includes: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip or apparatus supports at least two of the plurality of VAD modes;
the method comprises the following steps of configuring VAD mode use strategies required by a voice chip or equipment to use various VAD modes according to strategy configuration instructions, wherein the strategy comprises the following steps: according to the strategy configuration instruction, configuring at least one of environment information, a residual electric quantity range, user attributes, an upper application state, the use sequence and the use duration of various VAD modes and the identifications of various VAD modes corresponding to the voice chip or the equipment when the voice chip or the equipment uses various VAD modes, so that a user can configure the VAD mode currently used by the voice chip or the equipment according to at least one of the environment information, the current residual electric quantity, the user attributes, the associated upper application state, the use sequence and the use duration of various VAD modes configured in the VAD mode use strategy and the VAD mode identifications carried in the VAD configuration instruction through the VAD configuration instruction.
19. The method of claim 18, wherein configuring a VAD mode usage policy required by the voice chip or device to use a plurality of VAD modes according to the policy configuration instruction comprises:
when a voice chip or equipment is initialized, according to a strategy configuration instruction, VAD mode use strategies required by the voice chip or the equipment for using various VAD modes are configured;
or
When the voice chip or the equipment is configured in factory, according to the strategy configuration instruction, VAD mode use strategies required by the voice chip or the equipment for using various VAD modes are configured.
20. The method of claim 18, wherein configuring the environment information corresponding to the voice chip or the device when using various VAD modes according to the policy configuration instruction comprises:
acquiring environmental information in which a voice chip or equipment needs to be located according to a strategy configuration instruction;
and configuring the corresponding environment information of the voice chip or the equipment when various VAD modes are used according to the environment information in which the voice chip or the equipment needs to be positioned.
21. The method of claim 20, wherein obtaining the environmental information where the voice chip or the device needs to be located according to the policy configuration instruction comprises:
analyzing the environmental information of the voice chip or the equipment which needs to be positioned from the strategy configuration instruction;
or
And acquiring the environmental information of the voice chip or the equipment according to the strategy configuration instruction.
22. The method of claim 21, wherein receiving policy configuration instructions comprises:
and receiving a strategy configuration instruction sent by a configuration terminal, wherein the strategy configuration instruction is generated by the configuration terminal according to the environment information in which the voice chip or the equipment input by a configuration personnel on the configuration interface needs to be positioned.
23. A voice endpoint detection method is suitable for a voice chip or equipment, and the voice chip or equipment is provided with a hardware VAD function and a software VAD function, and the method comprises the following steps:
collecting sound signals input into a voice chip or equipment;
performing VAD processing on the sound signal by utilizing a VAD mode currently used by a voice chip or equipment;
wherein the currently used VAD mode is one of a plurality of VAD modes generated by the combined use of a hardware VAD function and a software VAD function; the plurality of VAD modes includes: a hardware VAD mode, a software VAD mode, a soft-hard combined VAD mode and a soft-soft combined VAD mode; the voice chip or apparatus supports at least two of the plurality of VAD modes;
before using the VAD mode currently used by the voice chip or device, the method further comprises the following steps: according to a preset VAD mode use strategy, configuring a VAD mode currently used by a voice chip or equipment;
the method for configuring the VAD mode currently used by the voice chip or the equipment according to the preset VAD mode use strategy comprises the following steps: and configuring the VAD mode currently used by the voice chip or the equipment according to at least one of the environmental information, the current residual capacity, the user attribute, the state of the associated upper application, the use sequence and the use duration of the VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
24. A speech chip, comprising: the system comprises a pickup module, a hardware VAD module, a processor and a memory; the memory stores VAD programs and power consumption control programs;
the pickup module is used for collecting the sound signals input into the voice chip;
the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled;
the processor to execute the VAD program to detect in software whether the sound signal includes a voice signal when a currently used VAD mode indicates that a software VAD function of the voice chip is enabled;
the processor is further configured to execute the power consumption control program to: under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode used currently, controlling the voice chip to enter a normal working mode from a low power consumption mode;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function, the plurality of VAD modes comprising: a hardware VAD mode, a software VAD mode, a soft-hard combined VAD mode and a soft-soft combined VAD mode; the voice chip supports at least two VAD modes in the plurality of VAD modes;
the processor is further configured to: according to a preset VAD mode use strategy, configuring the VAD mode currently used by a voice chip; when the processor configures the VAD mode currently used by the voice chip, the processor is specifically configured to: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information of the voice chip, the current residual capacity, the user attribute, the state of the associated upper layer application, the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
25. The speech chip of claim 24, wherein the processor is further configured to:
and controlling the voice chip to be kept in a low power consumption mode when the voice chip is detected to be not contained in the voice signal by using the VAD mode currently used.
26. The speech chip of claim 24,
the pickup module is specifically configured to: when the VAD mode used currently is a soft-hard combined VAD mode, the acquired voice signals are respectively sent to the hardware VAD module and the processor; or when the currently used VAD mode is the hardware VAD mode, sending the acquired voice signal into the hardware VAD module; or when the currently used VAD mode is a software VAD mode, the collected voice signals are sent to the processor.
27. The speech chip of claim 26, further comprising: a switching module; the switching module is connected among the pickup module, the hardware VAD module and the processor;
the switching module is used for switching the pickup module to be communicated with the hardware VAD module and/or the processor according to the VAD mode used currently.
28. The speech chip of claim 26, wherein the processor is specifically configured to:
when the currently used VAD mode is a soft-hard combined VAD mode, if the fact that the voice signal is detected by the hardware VAD module to contain the voice signal is determined according to the detection result reported by the hardware VAD module, whether the voice signal contains the voice signal is detected again in a software mode, and when the fact that the voice signal is detected again in the software mode to contain the voice signal, the voice chip is controlled to enter a normal working mode from a low power consumption mode; alternatively, the first and second electrodes may be,
when the currently used VAD mode is the hardware VAD mode, if the fact that the voice signals are detected by the hardware VAD module to contain voice signals is determined according to the detection result reported by the hardware VAD module, the voice chip is controlled to enter a normal working mode from a low power consumption mode; alternatively, the first and second electrodes may be,
when the currently used VAD mode is a software VAD mode, whether the voice signal contains the voice signal is detected in a software mode, and when the voice signal is detected in the software mode to contain the voice signal, the voice chip is controlled to enter a normal working mode from a low power consumption mode.
29. The voice chip of claim 24, wherein the processor is specifically configured to perform at least one of the following operations when the VAD mode usage policy is preconfigured:
configuring corresponding environment information of a voice chip or equipment when various VAD modes are used according to a strategy configuration instruction;
configuring a corresponding residual electric quantity range of a voice chip or equipment when various VAD modes are used according to a strategy configuration instruction;
configuring user attributes corresponding to voice chips or equipment when various VAD modes are used according to the strategy configuration instruction;
configuring the corresponding upper application state of the voice chip or the equipment when using various VAD modes according to the strategy configuration instruction;
configuring the use sequence and the use duration of various VAD modes according to the strategy configuration instruction;
and configuring the identifications of the VAD modes in the VAD configuration instructions according to the strategy configuration instructions.
30. A speech chip, comprising: the device comprises a pickup module, a hardware VAD module, a main processor, a coprocessor and a memory; the memory stores VAD program and power consumption control program;
the pickup module is used for collecting the sound signals input into the voice chip;
the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled;
the co-processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled;
the coprocessor is further configured to execute the power consumption control program to: under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode which is currently used, controlling the main processor to enter a normal working mode from a low power consumption mode;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function, the plurality of VAD modes comprising: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip supports at least two VAD modes in the plurality of VAD modes;
before utilizing the VAD mode currently used by the voice chip, the coprocessor further comprises: according to a preset VAD mode use strategy, configuring the VAD mode currently used by a voice chip; the method for configuring the VAD mode currently used by the voice chip according to the preset VAD mode use strategy comprises the following steps: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information of the voice chip, the current residual capacity, the user attribute, the state of the associated upper layer application, the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
31. The intelligent terminal is characterized by comprising a voice chip, wherein the voice chip comprises a pickup module, a hardware VAD module, a processor and a memory; the memory stores VAD program and power consumption control program;
the pickup module is used for collecting the sound signal input into the voice chip;
the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled;
the processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled;
the processor is further configured to execute the power consumption control program to: under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used, controlling the voice equipment to enter a normal working mode from a low power consumption mode;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function, the plurality of VAD modes comprising: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip supports at least two VAD modes in the plurality of VAD modes;
the processor is further configured to: before the VAD mode used currently is utilized, the VAD mode used currently by a voice chip is configured according to a preset VAD mode use strategy; the processor is specifically configured to, when configuring the VAD mode currently used by the voice chip according to a preconfigured VAD mode usage policy: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information of the voice chip, the current residual capacity, the user attribute, the state of the associated upper layer application, the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
32. The intelligent terminal according to claim 31, wherein the intelligent terminal is an intelligent alarm clock, an intelligent bracelet, an intelligent switch, an intelligent speaker, an intelligent sound box, a smart phone, an intelligent robot, an unmanned delivery vehicle, a self-service express cabinet or a self-service terminal.
33. The intelligent terminal is characterized by comprising a voice chip and a main processor, wherein the voice chip comprises a pickup module, a hardware VAD module, a coprocessor and a memory; the memory stores VAD program and power consumption control program;
the pickup module is used for collecting the sound signals input into the voice chip;
the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled;
the co-processor is used for executing the VAD program to detect whether the sound signal comprises a voice signal in a software mode when the VAD mode currently used indicates that the software VAD function of the voice chip is enabled;
the coprocessor is further configured to execute the power consumption control program to: controlling the main processor to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function, the plurality of VAD modes comprising: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip supports at least two VAD modes in the plurality of VAD modes;
the coprocessor is further configured to: before the VAD mode currently used by the voice chip is utilized, the VAD mode currently used by the voice chip is configured according to a preset VAD mode use strategy; the coprocessor is specifically configured to, when configuring the VAD mode currently used by the voice chip according to a preconfigured VAD mode use policy: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information, the current residual capacity, the user attribute, the state of the associated upper application, the use sequence and the use duration of the VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
34. The intelligent terminal of claim 33, wherein the intelligent terminal is an intelligent alarm clock, an intelligent bracelet, an intelligent switch, an intelligent speaker, a smart phone, an intelligent robot, an unmanned delivery vehicle, a self-service express cabinet or a self-service terminal.
35. A self-service terminal is characterized by comprising a voice chip and a main processor; the voice chip comprises a pickup module, a hardware VAD module, a coprocessor and a memory; the memory stores VAD program and power consumption control program;
the pickup module is used for collecting the sound signals input into the voice chip;
the hardware VAD module is used for detecting whether the sound signal comprises a voice signal in a hardware mode when the VAD mode currently used indicates that the hardware VAD function of the voice chip is enabled;
the co-processor for executing the VAD program to detect in software whether the sound signal includes a voice signal when a currently used VAD mode indicates that a software VAD function of the voice chip is enabled;
the coprocessor is further configured to execute the power consumption control program to: controlling the main processor to enter a normal working mode from a low power consumption mode under the condition that the voice signal is detected to contain the voice signal by utilizing the VAD mode currently used;
wherein the currently used VAD mode is one of a plurality of VAD modes resulting from the combined use of the hardware VAD function and the software VAD function, the plurality of VAD modes comprising: hardware VAD mode, software VAD mode, soft and hard combined VAD mode and soft combined VAD mode; the voice chip supports at least two VAD modes in the plurality of VAD modes;
the coprocessor is further configured to: before the VAD mode currently used by the voice chip is utilized, the VAD mode currently used by the voice chip is configured according to a preset VAD mode use strategy; the coprocessor is specifically configured to, when configuring the VAD mode currently used by the voice chip according to a preconfigured VAD mode use policy: and configuring the VAD mode currently used by the voice chip according to at least one of the current environment information of the voice chip, the current residual capacity, the user attribute, the state of the associated upper layer application, the use sequence and the use duration of a plurality of VAD modes configured in the VAD mode use strategy and the VAD mode identifier carried in the VAD configuration instruction.
36. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1-22.
CN202010176807.6A 2020-03-13 2020-03-13 Power consumption control, mode configuration and VAD method, apparatus and storage medium Active CN113393865B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010176807.6A CN113393865B (en) 2020-03-13 2020-03-13 Power consumption control, mode configuration and VAD method, apparatus and storage medium
PCT/CN2021/080172 WO2021180162A1 (en) 2020-03-13 2021-03-11 Power consumption control method and device, mode configuration method and device, vad method and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010176807.6A CN113393865B (en) 2020-03-13 2020-03-13 Power consumption control, mode configuration and VAD method, apparatus and storage medium

Publications (2)

Publication Number Publication Date
CN113393865A CN113393865A (en) 2021-09-14
CN113393865B true CN113393865B (en) 2022-06-03

Family

ID=77616161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010176807.6A Active CN113393865B (en) 2020-03-13 2020-03-13 Power consumption control, mode configuration and VAD method, apparatus and storage medium

Country Status (2)

Country Link
CN (1) CN113393865B (en)
WO (1) WO2021180162A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114531441B (en) * 2022-01-11 2024-03-12 南京博联智能科技有限公司 Method and system for converting form of multifunctional intelligent panel based on dynamic configuration
CN114512127B (en) * 2022-01-29 2023-12-26 深圳市九天睿芯科技有限公司 Voice control method, device, equipment, medium and intelligent voice acquisition system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102770909A (en) * 2010-02-24 2012-11-07 高通股份有限公司 Voice activity detection based on plural voice activity detectors
CN103065629A (en) * 2012-11-20 2013-04-24 广东工业大学 Speech recognition system of humanoid robot
WO2013188007A1 (en) * 2012-06-15 2013-12-19 Spansion Llc Power-efficient voice activation
CN105224074A (en) * 2015-08-31 2016-01-06 联想(北京)有限公司 A kind of control method and electronic equipment
CN106992015A (en) * 2015-12-22 2017-07-28 恩智浦有限公司 Voice-activation system
CN108665889A (en) * 2018-04-20 2018-10-16 百度在线网络技术(北京)有限公司 The Method of Speech Endpoint Detection, device, equipment and storage medium
CN110660413A (en) * 2018-06-28 2020-01-07 新唐科技股份有限公司 Voice activity detection system
CN110858488A (en) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 Voice activity detection method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5725028B2 (en) * 2010-08-10 2015-05-27 日本電気株式会社 Speech segment determination device, speech segment determination method, and speech segment determination program
US20140122078A1 (en) * 2012-11-01 2014-05-01 3iLogic-Designs Private Limited Low Power Mechanism for Keyword Based Hands-Free Wake Up in Always ON-Domain
US10748529B1 (en) * 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9854526B2 (en) * 2015-01-28 2017-12-26 Qualcomm Incorporated Sensor activated power reduction in voice activated mobile platform
CN106531165A (en) * 2016-12-15 2017-03-22 北京塞宾科技有限公司 Portable smart home voice control system and control method adopting same
CN108986822A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and non-transient computer storage medium
CN109473123B (en) * 2018-12-05 2022-05-31 百度在线网络技术(北京)有限公司 Voice activity detection method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102770909A (en) * 2010-02-24 2012-11-07 高通股份有限公司 Voice activity detection based on plural voice activity detectors
WO2013188007A1 (en) * 2012-06-15 2013-12-19 Spansion Llc Power-efficient voice activation
CN103065629A (en) * 2012-11-20 2013-04-24 广东工业大学 Speech recognition system of humanoid robot
CN105224074A (en) * 2015-08-31 2016-01-06 联想(北京)有限公司 A kind of control method and electronic equipment
CN106992015A (en) * 2015-12-22 2017-07-28 恩智浦有限公司 Voice-activation system
CN108665889A (en) * 2018-04-20 2018-10-16 百度在线网络技术(北京)有限公司 The Method of Speech Endpoint Detection, device, equipment and storage medium
CN110660413A (en) * 2018-06-28 2020-01-07 新唐科技股份有限公司 Voice activity detection system
CN110858488A (en) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 Voice activity detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113393865A (en) 2021-09-14
WO2021180162A1 (en) 2021-09-16

Similar Documents

Publication Publication Date Title
US11777266B2 (en) Custom power outlet socket that has integrated wireless functionality
CN111223497B (en) Nearby wake-up method and device for terminal, computing equipment and storage medium
CN106782540B (en) Voice equipment and voice interaction system comprising same
US20190107880A1 (en) Wearable device-aware supervised power management for mobile platforms
US11295760B2 (en) Method, apparatus, system and storage medium for implementing a far-field speech function
CN110827818A (en) Control method, device, equipment and storage medium of intelligent voice equipment
CN110060685B (en) Voice wake-up method and device
CN108470034B (en) A kind of smart machine service providing method and system
CN109493849A (en) Voice awakening method, device and electronic equipment
CN113393865B (en) Power consumption control, mode configuration and VAD method, apparatus and storage medium
US20200380968A1 (en) Voice response interfacing with multiple smart devices of different types
CN110956963A (en) Interaction method realized based on wearable device and wearable device
CN110347367A (en) Volume adjusting method, terminal device, storage medium and electronic equipment
CN110767225B (en) Voice interaction method, device and system
CN112489413B (en) Control method and system of remote controller, storage medium and electronic equipment
CN109584874A (en) Electrical equipment control method, device, electrical equipment and storage medium
CN111667825A (en) Voice control method, cloud platform and voice equipment
CN102884843B (en) Wireless Personal Network (PAN) coordinator of economize on electricity is realized by enlivening the conversion between sleep state
CN113506568A (en) Central control and intelligent equipment control method
KR20230118164A (en) Combining device or assistant-specific hotwords into a single utterance
CN113728294B (en) Power consumption control and scheme generation method, device, system and storage medium
CN109658924B (en) Session message processing method and device and intelligent equipment
WO2023155607A1 (en) Terminal devices and voice wake-up methods
CN111862965A (en) Awakening processing method and device, intelligent sound box and electronic equipment
CN113470635B (en) Intelligent sound box control method, intelligent sound box control equipment, central control equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant