CN111341317B

CN111341317B - Method, device, electronic equipment and medium for evaluating wake-up audio data

Info

Publication number: CN111341317B
Application number: CN202010101559.9A
Authority: CN
Inventors: 欧双
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2023-09-01
Anticipated expiration: 2040-02-19
Also published as: CN111341317A

Abstract

The application discloses a method, a device, electronic equipment and a medium for evaluating wake-up audio data. In the application, after the wake-up audio data for waking up the target device is obtained, an evaluation index for representing the success rate of waking up the target device by the wake-up audio data can be generated based on the wake-up audio data and a preset matching strategy, and then the evaluation index is displayed. By applying the technical scheme of the application, after receiving the audio data which is customized by the user and is used for waking up the intelligent equipment, according to the matching degree of the preset strategy and the audio data, the evaluation information of whether the audio data is suitable for being used as the wake-up voice can be obtained, so that the user can determine whether the wake-up voice with higher wake-up success rate needs to be regenerated according to the evaluation information. And further, the problem that the wake-up accuracy is low easily caused by user-defined wake-up voice in the related technology can be avoided.

Description

Method, device, electronic equipment and medium for evaluating wake-up audio data

Technical Field

The present application relates to data processing technologies, and in particular, to a method and apparatus for evaluating wake-up audio data, an electronic device, and a medium

Background

As the communications age and society rise, smart devices have evolved with the use of more and more users.

In the related art, a user can place an intelligent device in a certain space, and the intelligent device can meet various requirements of the user by means of voice awakening. Taking a smart phone as an example, when a user wants to use the smart phone, the user can speak a corresponding wake-up word to wake up the smart phone, so as to realize corresponding functions such as dialing a phone. Or, taking the intelligent sound box as an example, when a user wants to use the sound box, he can speak a corresponding wake-up word to wake up the intelligent sound box so as to realize corresponding functions such as playing music. The wake-up word can be used as the wake-up word of the intelligent device by entering a section of specific self sound when the user customizes the intelligent device.

However, the above manner of waking up the smart device by the user setting may have a problem of poor waking up accuracy, thereby degrading the user experience.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a medium for evaluating wake-up audio data; the application can solve the problem of low awakening accuracy caused by the fact that the user can customize the awakening voice of the intelligent device in the related technology.

According to an aspect of the embodiment of the present application, there is provided a method for evaluating wake-up audio data, including:

acquiring wake-up audio data, wherein the wake-up audio data is used for waking up target equipment;

generating an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target equipment;

the evaluation index is shown.

According to another aspect of the embodiment of the present application, there is provided an apparatus for evaluating wake-up audio data, including:

the acquisition module is used for acquiring wake-up audio data, wherein the wake-up audio data are used for waking up the target equipment;

the generation module is configured to generate an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target device;

and a display module configured to display the evaluation index.

According to still another aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the operations of any of the above methods of evaluating wake-up audio data.

According to still another aspect of the embodiments of the present application, there is provided a computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of any of the above-described wake-up audio data evaluation methods.

In the application, after the wake-up audio data for waking up the target device is obtained, an evaluation index for representing the success rate of waking up the target device by the wake-up audio data can be generated based on the wake-up audio data and a preset matching strategy, and then the evaluation index is displayed. By applying the technical scheme of the application, after receiving the audio data which is customized by the user and is used for waking up the intelligent equipment, according to the matching degree of the preset strategy and the audio data, the evaluation information of whether the audio data is suitable for being used as the wake-up voice can be obtained, so that the user can determine whether the wake-up voice with higher wake-up success rate needs to be regenerated according to the evaluation information. And further, the problem that the wake-up accuracy is low easily caused by user-defined wake-up voice in the related technology can be avoided.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The application may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a system architecture for sorting articles according to the present application;

fig. 2 is a schematic diagram of a method for evaluating wake-up audio data according to the present application;

fig. 3 is a schematic diagram of a method for evaluating wake-up audio data according to the present application;

FIGS. 4a-4d are schematic diagrams of a first scene image according to the present application;

fig. 5 is a schematic structural diagram of an evaluation device for wake-up audio data according to the present application;

fig. 6 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In addition, the technical solutions of the embodiments of the present application may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, the combination of the technical solutions should be considered as not existing, and not falling within the scope of protection claimed by the present application.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present application are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicators are correspondingly changed.

A method for conducting a call according to an exemplary embodiment of the present application is described below with reference to fig. 1 to 4. It should be noted that the following application scenarios are only shown for facilitating understanding of the spirit and principles of the present application, and embodiments of the present application are not limited in this respect. Rather, embodiments of the application may be applied to any scenario where applicable.

For example, the embodiments of the present application may be applied to a wireless communication system, and it should be noted that the wireless communication system mentioned in the embodiments of the present application includes, but is not limited to, a 5G mobile communication system and three application scenario enhanced mobile broadband (Enhanced Mobile Broad Band, eMBB), URLLC, and Massive Machine-communication (mctc) of a next generation mobile communication system.

In an embodiment of the present application, a Terminal (Terminal device) includes, but is not limited to, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), a Mobile phone (Mobile Terminal), a handset (handset), a portable device (portable equipment), and the like, and the Terminal may communicate with one or more core networks via a radio access network (RAN, radio Access Network), for example, the Terminal may be a Mobile phone (or referred to as a "cellular" phone), a computer with a wireless communication function, and the like, and the Terminal may also be a portable, pocket, hand-held, computer-built-in, or vehicle-mounted Mobile device or equipment.

Fig. 1 is a schematic diagram of a communication system architecture according to the present application.

Referring to fig. 1, a communication system 01 includes a network device 101 and a terminal 102; wherein the network device 101 is deployed using NSA mode. When the communication system 01 comprises a core network, the network device 101 may also be connected to the core network. The network device 101 may also be in communication with an internet protocol (Internet Protocol, IP) network 200, such as the internet, a private IP network, or other data network, among others. The network device provides services for terminals within the coverage area. For example, referring to fig. 1, a network device 101 provides wireless access to one or more terminals within the coverage area of the network device 101. In addition, the network devices can also communicate with each other.

The network device 101 may be a device for communicating with a terminal. The network device may be a relay station, an access point, an in-vehicle device, etc. In a terminal-to-Device (D2D) communication system, the network Device may also be a terminal functioning as a base station. The terminals may include various handheld devices, in-vehicle devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, as well as various forms of User Equipment (UE), mobile Stations (MS), etc. with wireless communication capabilities.

In the related art, when a user self-defines a wake-up voice for waking up an intelligent device, the user directly uses a specific self-voice as a wake-up word of the intelligent device after entering the specific self-voice. However, the above manner of waking up the smart device by the user setting may have a problem of poor waking up accuracy, thereby degrading the user experience.

Based on the above problems, the application provides a wake-up audio data evaluation method, a device, an electronic device and a medium.

Fig. 2 schematically shows a flow chart of a method of evaluating wake-up audio data according to an embodiment of the application. As shown in fig. 2, includes:

s101, wake-up audio data is acquired and used for waking up target equipment.

In the present application, the device for acquiring the wake-up audio data is not specifically limited, and may be, for example, an intelligent device or a server. The intelligent device can be a PC (Personal Computer ), a smart phone, a tablet computer, an electronic book reader and an MP3 (Moving Picture Experts Group AudioLayer III, dynamic image expert compression standard audio layer 3) device for sorting articles. MP4 (Moving Picture ExpertsGroup Audio Layer IV, dynamic image expert compression standard audio layer 4) is used for arranging articles, or portable terminal equipment with display function such as portable computer.

Further, the target device is not specifically limited in the present application. In one embodiment, it may be a smart device that operates based on smart voice. For example, the intelligent sound box is a sound box upgrading product, and is a tool for a household consumer to play songs on demand by voice and set an alarm clock. Or, for example, the user can use the smart phone to play audio, shop online, know weather forecast, control smart home devices, such as opening curtains, setting refrigerator temperature, and heating the water heater in advance, through a voice assistant installed on the smart phone.

Furthermore, the voice assistant of the smart phone is an intelligent mobile phone application, and can help the user to complete various functions and solve the living problems through the interaction function of intelligent dialogue and instant question-answering.

When the user needs to use the target device, the device in dormancy can be awakened in a voice awakening mode, so that the device can be awakened to complete the desired function. For example, when the mobile phone is in the dormant state or the screen locking state, the mobile phone can directly enter the waiting instruction state by monitoring the designated wake-up sound (the set voice instruction, i.e. the wake-up audio data) of the user, so that the first step of voice interaction can be started.

In addition, the application is not particularly limited to wake-up audio data. Which can be a speech segment set by the user himself according to his own mind. For example, one or more words, and the like. In one embodiment, the wake-up audio data may correspond to one or more different languages. For example, chinese, english, japanese, etc.

S102, based on the wake-up audio data and a preset matching strategy, generating an evaluation index corresponding to the wake-up audio data, wherein the evaluation index is used for representing the success rate of waking up the target device by the wake-up audio data.

Furthermore, after the wake-up audio data is obtained, the evaluation index corresponding to the wake-up audio data can be generated for the user by utilizing the preset matching strategy. It will be appreciated that the results reflected by the evaluation index may be used to assist the user in determining whether the wake-up audio data is suitable as wake-up speech for the wake-up target device.

The present application is not limited to a specific matching strategy, that is, the evaluation index for the wake-up audio data may be generated according to any standard or condition.

S103, displaying the evaluation index.

Furthermore, the method for displaying the evaluation index is not particularly limited, and the evaluation index can be displayed on a display screen of a mobile phone, for example. It can also be shown on the display screen of the intelligent sound box, etc.

Alternatively, in one possible embodiment of the present application, in S102 of the present application (generating the evaluation index corresponding to the wake-up audio data), the result may be generated by:

And respectively performing similarity matching on the awakening audio data and each audio data in a preset matching database to generate a corresponding similarity matching value, wherein the audio data comprises voice data and noise data.

Further, the present application does not specifically limit the audio data stored in the matching database. For example, one or more voice data, and one or more noise data. In one embodiment, if the wake-up audio data successfully matches one or more of the noise data in the matching database, it is indicative that the wake-up audio data has too much noise disturbance during recording. A result of a lower evaluation index can be generated for the audio data.

The noise data stored in the matching database is not particularly limited. For example, additive acoustic noise: the microphone records the background environment sound simultaneously when recording the voice. Or acoustic reverberation noise: superposition effects caused by multipath reflections. Convolved channel effect noise may also be: resulting in a non-uniform or bandwidth limited response, the communication channel is not effectively modeled when channel equalization is performed in order to remove the channel impulse response. Nonlinear distortion noise may also be: such as improper gain when the signal is input. And may be additive broadband electronic noise, electrical interference noise, distortion noise caused by insufficient wind-resistant frequency response, and the like.

Further, in the process of performing similarity matching on the wake-up audio data and each audio data in the preset matching database, the embodiment of the application can be obtained by any one or more of the following modes:

the first way is:

and respectively performing similarity matching on the awakening audio data and the audio data, and determining the quantity of first audio data corresponding to the awakening audio data, wherein the first audio data is the audio data with the pronunciation similarity of the awakening audio data exceeding a first threshold value.

First, the present application does not specifically limit the voice data in the matching database. For example, speech data that appears in all movie scenes may be included. Voice data that appears in all songs may also be included. And can also be all dialogue voice data in daily life scenes.

It can be appreciated that after the wake-up audio data and the audio data are respectively subjected to similarity matching, the corresponding similarity matching value can be generated according to the number of the first audio data. In one embodiment, when it is detected that the more of the one or more audio data has an approximation of the pronunciation of the wake-up audio data that exceeds the first threshold (the number of first audio data), it represents that the more of the audio data in the matching database has a pronunciation similar to or identical to the wake-up audio data. That is, the wake-up audio data may occur in multiple pronunciation occasions. Therefore, in order to avoid the situation of false awakening for a plurality of times, the application can correspondingly lower the evaluation index of the awakening audio data with excessive quantity of the first audio data.

For example, taking wake-up audio data as "me" as an example, it is to be appreciated that since "me" audio data may appear in multiple pronunciation scenes (matching databases) (e.g., movie scenes, song scenes, life dialog scenes, work dialog scenes, etc.). The number of wake-up audio data ("me") is also greater than the number of audio data (the number of first audio data) that are similar or identical in pronunciation to the matching database (the pronunciation approximation exceeds the first threshold). Therefore, in order to avoid a situation in which the target device is not awakened by the user, the target device detects "me" wake-up voice data, resulting in false wake-up of the device. The application can generate the matching result with the evaluation index of 'unsuitable for being used as wake-up voice data' according to the matching result.

For example, taking wake-up audio data as "peach classmates" as an example, it is understood that the frequency of occurrence of audio data of "peach classmates" in a plurality of pronunciation scenes (matching databases) is small (e.g., movie scenes, song scenes, life dialogue scenes, work dialogue scenes, etc.). The number of wake-up audio data ("peach classmates") is also smaller than the number of audio data (the number of first audio data) that are similar or identical to the pronunciation in the matching database (the pronunciation approximation exceeds the first threshold). Therefore, in order to reduce the situation of equipment false wake-up, the application can generate the matching result with the evaluation index of being suitable for being used as wake-up voice data according to the situation.

The first threshold value is not particularly limited, and may be, for example, 50% or 80%.

The second way is:

and respectively performing similarity matching on the awakening audio data and the audio data, and determining the quantity of second audio data corresponding to the awakening audio data, wherein the second audio data is the audio data with semantic similarity to the awakening audio data exceeding a second threshold value.

It can be appreciated that after the wake-up audio data and the audio data are respectively subjected to similarity matching, the application can generate a corresponding similarity matching value according to the number of the second audio data. In one embodiment, when it is detected that the more of the one or more audio data has a pronunciation approximation with the wake-up audio data exceeding a second threshold (the number of second audio data), it represents that more audio data is semantically similar or identical to the wake-up audio data in the matching database. That is, the wake-up audio data may contain multiple meanings at the same time. Thus representing that the wake-up audio data may occur in multiple pronunciation occasions. Therefore, in order to avoid the situation of false awakening for a plurality of times, the application can correspondingly lower the evaluation index of the awakening audio data with excessive second audio data.

For example, with wake-up audio data as "bundle", it will be appreciated that there are a number of semantics for the "bundle" audio data (which may be a wrapped garment, a mental burden, or a vocal terminology for laughter viewers). The number of wake-up audio data ("bundles") is also greater than the number of audio data (number of second audio data) that are semantically similar or identical (semantic similarity exceeds a second threshold) to the matching database. It will be appreciated that when the semantic number of wake-up audio data ("bundles") is large, it is representative that the wake-up audio data may occur in multiple pronunciation occasions. Therefore, in order to avoid a situation in which the target device is not awakened by the user, the target device detects the "bundle" wake-up voice data, which causes the device to wake up erroneously. The application can generate the matching result with the evaluation index of 'unsuitable for being used as wake-up voice data' according to the matching result.

Still further, by taking the wake-up audio data as "accounting" for example, it can be understood that, because the audio data of "accounting" has a plurality of semantics (which can be the meaning of calculating, on an accounting basis, a balance or balance reflecting the business activities and achievements of the enterprise or budgets of the administrative utility, or also reporting the amount of disputes between the person after a loss or failure). The number of wake-up audio data ("accounting") is also greater than the number of audio data (the number of second audio data) that are semantically similar or identical (semantic similarity exceeds the second threshold) to the matching database. It will be appreciated that when the semantic amount of wake-up audio data ("accounting") is large, it is representative that the wake-up audio data may occur in multiple pronunciation occasions. Therefore, in order to avoid a situation in which a device wakes up erroneously due to the target device detecting "accounting" wake-up voice data in a case where the user does not want to wake up the target device. The application can generate the matching result with the evaluation index of 'unsuitable for being used as wake-up voice data' according to the matching result.

It should be noted that the second threshold value is not specifically limited in the present application, and may be, for example, 50%, 80%, or the like.

And generating an evaluation index based on the magnitude relation between the similarity matching result and the preset quantity.

It can be appreciated that the similarity matching result in the present application is obtained with the number of the first audio data and/or the number of the second audio data. In one embodiment, the present application may generate the corresponding evaluation index according to the magnitude relation between the number of the first audio data and/or the number of the second audio data and the preset number.

It should be noted that the preset number is not specifically limited in the present application, and may be, for example, 5, 10, or the like.

analyzing the wake-up audio data to obtain audio characteristic parameters corresponding to the wake-up audio data, wherein the audio characteristic parameters are used for reflecting pronunciation information of the wake-up audio data.

And generating an evaluation index based on the audio characteristic parameters and the matching strategy.

Furthermore, in the process of generating the evaluation index based on the audio characteristic parameters and the matching strategy, the application can be obtained by the following two modes:

The first way is:

determining pronunciation time length and pronunciation speed corresponding to the awakening audio data based on the audio characteristic parameters;

and generating an evaluation index based on the matching relation between the pronunciation time length and the first preset condition and the matching relation between the pronunciation speed and the second preset condition.

Further, in the process of generating the wake-up audio data, whether the wake-up audio data is suitable for being used as the wake-up audio data can be determined based on the pronunciation time length and pronunciation speed corresponding to the wake-up audio data. It will be appreciated that the evaluation index as a wake word is affected when the wake-up audio data sounds too long or too short for the duration of the sound. Similarly, when the wake-up audio data utters too fast or too slow, the evaluation index as a wake-up word is also affected.

The second way is:

determining language information and intonation information corresponding to the wake-up audio data based on the audio characteristic parameters;

and generating an evaluation index based on the matching relation between the language information and the third preset condition and the matching relation between the intonation information and the fourth preset condition.

Further, in the process of generating wake-up audio data, whether the wake-up audio data is suitable for being used as the wake-up audio data can be determined based on the language information and the intonation information corresponding to the wake-up audio data. It will be appreciated that, for language information, specific language information may be fixedly selected as a factor affecting the evaluation index thereof. Similarly, when the intonation of the wake-up audio data is too high or too low, the evaluation index as a wake-up word is also affected.

Optionally, in one possible embodiment of the present application, after the step S101 of the present application (obtaining wake-up audio data), as shown in fig. 3, a method for evaluating wake-up audio data is further included:

s201, wake-up audio data is acquired.

S202, acquiring the wake-up biological characteristics of the target user by using the camera acquisition unit.

S203, generating an evaluation index corresponding to the wake-up audio data based on the wake-up audio data, the wake-up biological characteristics and the matching strategy.

Furthermore, the application does not specifically limit the wake-up biological feature, and can be one or more of facial feature information, iris feature information and fingerprint feature information of the target user. So that the evaluation index corresponding to the wake-up audio data is generated together according to the wake-up audio data and the wake-up biological characteristics.

For example, after the wake-up audio data is acquired, the camera of the terminal and/or a fingerprint sensor or other acquisition device may be used to acquire the biometric information (i.e. at least one of facial feature information, iris feature information, and fingerprint feature information) of the target user. And after the corresponding information is acquired, mapping the biological characteristic information and the corresponding wake-up audio data, and using the biological characteristic information and the corresponding wake-up audio data together as a wake-up instruction of wake-up target equipment.

Taking wake-up biological features as facial feature information as an example, for a terminal, after a facial image of a target user is acquired by using an image capturing device, feature information of the facial image may be extracted by using a neural network model. It should be noted that the present application is not limited to a preset neural network model, and in a possible implementation manner, the feature recognition may be performed on the facial image by using a convolutional neural network model.

Among them, convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional calculation and has a deep structure, and are one of representative algorithms for deep learning. Convolutional neural networks have the ability to characterize learning (representation learning) and can classify input information in a hierarchical structure with no change. Thanks to the strong characteristic characterization capability of CNN (convolutional neural network) on images, the CNN has remarkable effects in the fields of image classification, target detection, semantic segmentation and the like.

Furthermore, after the wake-up audio data and the facial features uploaded by the user are detected, the application can also generate corresponding evaluation indexes according to the definition of the facial features. It will be appreciated that the higher the sharpness of the facial feature, the higher its corresponding evaluation index.

S204, displaying the evaluation index.

In the present application, fig. 4a to fig. 4d are taken as an example of a mobile phone with a target device, after the mobile phone obtains wake-up audio data for waking up the target device (as shown in fig. 4 a), an evaluation index (as shown in fig. 4 b) corresponding to the wake-up audio data and used for characterizing a success rate of waking up the target device by the wake-up audio data can be generated based on a matching policy of the wake-up audio data and a preset. Further, the evaluation index may be displayed by a display unit of the mobile phone (as shown in fig. 4c and as shown in fig. 4 d).

In another embodiment of the present application, as shown in fig. 5, the present application further provides an apparatus for evaluating wake-up audio data. The device comprises an acquisition module 301, a generation module 302 and a display module 303, wherein:

an acquisition module 301 configured to acquire wake-up audio data, where the wake-up audio data is used to wake up a target device;

the generating module 302 is configured to generate an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching policy, where the evaluation index is used to characterize a success rate of the wake-up audio data to wake up the target device;

a presentation module 303 is arranged to present the evaluation index.

In another embodiment of the present application, the generating module 302 further includes:

the generating module 302 is configured to perform similarity matching on the wake-up audio data and each audio data in a preset matching database, where the audio data includes voice data and noise data;

the generating module 302 is configured to generate the evaluation index based on the magnitude relation between the similarity matching result and the preset number.

the generating module 302 is configured to perform similarity matching on the wake-up audio data and the audio data respectively, and determine the number of first audio data corresponding to the wake-up audio data, where the first audio data is audio data with a pronunciation similarity with the wake-up audio data exceeding a first threshold;

and/or the number of the groups of groups,

the generating module 302 is configured to perform similarity matching on the wake-up audio data and the audio data respectively, and determine the number of second audio data corresponding to the wake-up audio data, where the second audio data is audio data with semantic similarity with the wake-up audio data exceeding a second threshold.

the generating module 302 is configured to parse the wake-up audio data to obtain an audio feature parameter corresponding to the wake-up audio data, where the audio feature parameter is used to reflect pronunciation information of the wake-up audio data;

a generation module 302 is configured to generate the evaluation index based on the audio feature parameters and the matching policy.

a generating module 302, configured to determine a pronunciation duration and a pronunciation speed corresponding to the wake-up audio data based on the audio feature parameter;

the generating module 302 is configured to generate the evaluation index based on the matching relationship between the pronunciation time length and the first preset condition and the matching relationship between the pronunciation speed and the second preset condition.

a generating module 302, configured to determine, based on the audio feature parameter, language information and intonation information corresponding to the wake-up audio data;

the generating module 302 is configured to generate the evaluation index based on the matching relationship between the language information and the third preset condition and the matching relationship between the intonation information and the fourth preset condition.

a generation module 302 configured to acquire a wake-up biometric of the target user using the camera acquisition unit, the wake-up biometric corresponding to at least one of facial features and gesture features;

the generating module 302 is configured to generate an evaluation index corresponding to the wake-up audio data based on the wake-up audio data, the wake-up biometric feature and the matching policy.

Fig. 6 is a block diagram of a logic structure of an electronic device, according to an example embodiment. For example, electronic device 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 6, electronic device 400 may include one or more of the following components: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores such as a 4-core processor, an 8-core processor, etc. The processor 401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 401 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the interactive special effect calibration method provided by the method embodiments of the present application.

In some embodiments, the electronic device 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402, and peripheral interface 403 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 403 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, a touch display 405, a camera 406, audio circuitry 407, a positioning component 408, and a power supply 409.

Peripheral interface 403 may be used to connect at least one Input/Output (I/O) related peripheral to processor 401 and memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 401, memory 402, and peripheral interface 403 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 404 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 404 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 404 may also include NFC (Near Field Communication ) related circuitry, which is not limiting of the application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to collect touch signals at or above the surface of the display screen 405. The touch signal may be input as a control signal to the processor 401 for processing. At this time, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be one, providing a front panel of the electronic device 400; in other embodiments, the display screen 405 may be at least two, and disposed on different surfaces of the electronic device 400 or in a folded design; in still other embodiments, the display 405 may be a flexible display disposed on a curved surface or a folded surface of the electronic device 400. Even more, the display screen 405 may be arranged in an irregular pattern that is not rectangular, i.e. a shaped screen. The display 405 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple and separately disposed at different locations of the electronic device 400. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 407 may also include a headphone jack.

The location component 408 is used to locate the current geographic location of the electronic device 400 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 408 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, the grainer system of russia, or the galileo system of the european union.

The power supply 409 is used to power the various components in the electronic device 400. The power supply 409 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When power supply 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 400 further includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyroscope sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the electronic device 400. For example, the acceleration sensor 411 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 401 may control the touch display screen 405 to display a user interface in a lateral view or a longitudinal view according to the gravitational acceleration signal acquired by the acceleration sensor 411. The acceleration sensor 411 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the electronic device 400, and the gyro sensor 412 may collect a 3D motion of the user on the electronic device 400 in cooperation with the acceleration sensor 411. The processor 401 may implement the following functions according to the data collected by the gyro sensor 412: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 413 may be disposed at a side frame of the electronic device 400 and/or at an underlying layer of the touch screen 405. When the pressure sensor 413 is disposed on a side frame of the electronic device 400, a grip signal of the user on the electronic device 400 may be detected, and the processor 401 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 414 is used to collect a fingerprint of the user, and the processor 401 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 401 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, sorting items, and changing settings, etc. The fingerprint sensor 414 may be provided on the front, back, or side of the electronic device 400. When a physical key or vendor Logo is provided on the electronic device 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 according to the ambient light intensity collected by the optical sensor 415. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 405 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also referred to as a distance sensor, is typically provided on the front panel of the electronic device 400. The proximity sensor 416 is used to collect distance between the user and the front of the electronic device 400. In one embodiment, when the proximity sensor 416 detects a gradual decrease in the distance between the user and the front of the electronic device 400, the processor 401 controls the touch display 405 to switch from the bright screen state to the off screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the electronic device 400 gradually increases, the processor 401 controls the touch display screen 405 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the electronic device 400 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, such as memory 404, comprising instructions executable by processor 420 of electronic device 400 to perform a method of waking up the device as described above, the method comprising: acquiring wake-up audio data, wherein the wake-up audio data is used for waking up target equipment; generating an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target equipment; the evaluation index is shown. Optionally, the above instructions may also be executed by the processor 420 of the electronic device 400 to perform the other steps involved in the above-described exemplary embodiments. Optionally, the above instructions may also be executed by the processor 420 of the electronic device 400 to perform the other steps involved in the above-described exemplary embodiments. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, there is also provided an application/computer program product comprising one or more instructions executable by the processor 420 of the electronic device 400 to perform a method of waking up a device as described above, the method comprising: acquiring wake-up audio data, wherein the wake-up audio data is used for waking up target equipment; generating an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target equipment; the evaluation index is shown. Optionally, the above instructions may also be executed by the processor 420 of the electronic device 400 to perform the other steps involved in the above-described exemplary embodiments. Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of evaluating wake-up audio data, comprising:

based on the awakening audio data and a preset matching strategy, respectively performing similarity matching on the awakening audio data and each audio data in a preset matching database, wherein the audio data comprises voice data and noise data;

based on the magnitude relation between the similarity matching result and the preset number, generating an evaluation index corresponding to the wake-up audio data, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target equipment;

displaying the evaluation index;

the step of performing similarity matching on the wake-up audio data and each audio data in a preset matching database respectively includes:

and respectively performing similarity matching on the awakening audio data and the audio data, and determining the quantity of second audio data corresponding to the awakening audio data, wherein the second audio data are audio data with semantic similarity exceeding a second threshold value with the awakening audio data, the quantity of the second audio data is used as a similarity matching result, and the semantic similarity is used for representing the similarity or the same degree of the semantic similarity of a matching database and the awakening audio data.

2. The method of claim 1, wherein the performing similarity matching on the wake-up audio data and each audio data in a preset matching database respectively further comprises:

3. The method of claim 1, wherein the generating the corresponding rating index for the wake-up audio data comprises:

analyzing the awakening audio data to obtain audio characteristic parameters corresponding to the awakening audio data, wherein the audio characteristic parameters are used for reflecting pronunciation information of the awakening audio data;

and generating the evaluation index based on the audio characteristic parameters and the matching strategy.

4. The method of claim 3, wherein the generating the evaluation index based on the audio feature parameters and the matching policy comprises:

And generating the evaluation index based on the matching relation between the pronunciation time length and the first preset condition and the matching relation between the pronunciation speed and the second preset condition.

5. The method of claim 3 or 4, wherein the generating the evaluation index based on the audio feature parameters and the matching policy comprises:

and generating the evaluation index based on the matching relation between the language information and the third preset condition and the matching relation between the intonation information and the fourth preset condition.

6. The method of claim 1, further comprising, after the acquiring wake-up audio data:

acquiring wake-up biological characteristics of a target user by using a camera acquisition unit, wherein the wake-up biological characteristics correspond to at least one of facial characteristics and gesture characteristics;

and generating an evaluation index corresponding to the wake-up audio data based on the wake-up audio data, the wake-up biological characteristics and the matching strategy.

7. An apparatus for evaluating wake-up audio data, comprising:

the generation module is configured to respectively perform similarity matching on the wake-up audio data and each audio data in a preset matching database based on the wake-up audio data and a preset matching strategy, wherein the audio data comprises voice data and noise data;

the generating module is configured to generate an evaluation index corresponding to the wake-up audio data based on the magnitude relation between the similarity matching result and the preset number, wherein the evaluation index is used for representing the success rate of the wake-up audio data for waking up the target device;

a presentation module arranged to present the evaluation index;

the generation module is configured to perform similarity matching on the wake-up audio data and the audio data respectively, determine the number of second audio data corresponding to the wake-up audio data, where the second audio data is audio data with semantic similarity with the wake-up audio data exceeding a second threshold, and the number of second audio data is used as a similarity matching result, where the semantic similarity is used to represent the similarity or the same degree of the semantic similarity of the matching database and the wake-up audio data.

8. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the operations of the method of evaluating wake-up audio data as claimed in any one of claims 1-6.

9. A computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the method of evaluating wake-up audio data as claimed in any one of claims 1-6.