CN111341317A

CN111341317A - Method and device for evaluating awakening audio data, electronic equipment and medium

Info

Publication number: CN111341317A
Application number: CN202010101559.9A
Authority: CN
Inventors: 欧双
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2020-06-26
Anticipated expiration: 2040-02-19
Also published as: CN111341317B

Abstract

The application discloses a method and a device for evaluating awakening audio data, electronic equipment and a medium. After the awakening audio data for awakening the target device is obtained, an evaluation index for representing the success rate of awakening the target device by the awakening audio data can be generated based on the awakening audio data and a preset matching strategy, and then the evaluation index is displayed. By applying the technical scheme, after the audio data which are customized by the user and used for awakening the intelligent device are received, the evaluation information of whether the audio data are suitable for being used as the awakening voice is obtained according to the preset strategy and the matching degree of the preset strategy, so that the user can determine whether the awakening voice with higher awakening success rate needs to be regenerated according to the evaluation information. And then the problem that the awakening precision is low easily caused by the user-defined awakening voice in the related technology can be avoided.

Description

Method and device for evaluating awakening audio data, electronic equipment and medium

Technical Field

The present application relates to data processing technologies, and in particular, to a method and an apparatus for evaluating wake-up audio data, an electronic device, and a medium

Background

Due to the rise of the communications era and society, smart devices have been continuously developed with the use of more and more users.

In the related art, a user can place an intelligent device in a certain space, and the intelligent device can meet various requirements of the user in a voice awakening mode. Taking a smart phone as an example, when a user wants to use the smart phone, the user can speak a corresponding wake-up word to wake up the smart phone, so as to realize corresponding functions such as dialing a call. Or, taking the smart speaker as an example, when a user wants to use the speaker, the user can speak a corresponding wake-up word to wake up the smart speaker, so as to realize corresponding functions of playing music and the like. The awakening word can be used as the awakening word of the intelligent device by inputting a specific sound when the user self-defines the intelligent device.

However, the above-mentioned manner of waking up the smart device by user setting may have a problem of poor wake-up accuracy, thereby reducing user experience.

Disclosure of Invention

The embodiment of the application provides an evaluation method, an evaluation device, electronic equipment and a medium for awakening audio data; the method and the device can solve the problem that the awakening accuracy is low easily caused when the awakening voice of the user-defined intelligent device exists in the related technology.

According to an aspect of the embodiments of the present application, there is provided a method for evaluating wake-up audio data, including:

acquiring awakening audio data, wherein the awakening audio data is used for awakening target equipment;

generating an evaluation index corresponding to the awakening audio data based on the awakening audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of awakening the target device by the awakening audio data;

and displaying the evaluation index.

According to another aspect of the embodiments of the present application, there is provided an apparatus for evaluating wake-up audio data, including:

an acquisition module configured to acquire wake-up audio data, the wake-up audio data being used to wake up a target device;

the generating module is configured to generate an evaluation index corresponding to the awakening audio data based on the awakening audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of awakening the target device by the awakening audio data;

a presentation module configured to present the evaluation index.

According to another aspect of the embodiments of the present application, there is provided an electronic device including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the operations of any of the above described methods of assessing wake up audio data.

According to a further aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer-readable instructions, which, when executed, perform the operations of any one of the above-mentioned methods for evaluating wake-up audio data.

In the application, after the awakening audio data for awakening the target device is obtained, an evaluation index for representing the success rate of awakening the target device by the awakening audio data can be generated based on the awakening audio data and a preset matching strategy, and then the evaluation index is displayed. By applying the technical scheme, after the audio data which are customized by the user and used for awakening the intelligent device are received, the evaluation information of whether the audio data are suitable for being used as the awakening voice is obtained according to the preset strategy and the matching degree of the preset strategy, so that the user can determine whether the awakening voice with higher awakening success rate needs to be regenerated according to the evaluation information. And then the problem that the awakening precision is low easily caused by the user-defined awakening voice in the related technology can be avoided.

The technical solution of the present application is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of the system architecture for organizing articles according to the present application;

fig. 2 is a schematic diagram of an evaluation method of wake-up audio data according to the present application;

fig. 3 is a schematic diagram of an evaluation method of wake-up audio data according to the present application;

4a-4d are schematic diagrams of a first scene image proposed in the present application;

fig. 5 is a schematic structural diagram of an apparatus for evaluating wake-up audio data according to the present application;

fig. 6 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In addition, technical solutions between the various embodiments of the present application may be combined with each other, but it must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present application.

It should be noted that all the directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present application are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly.

A method for making a call according to an exemplary embodiment of the present application is described below in conjunction with fig. 1-4. It should be noted that the following application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

For example, the embodiments of the present application can be applied to a wireless communication system, and it should be noted that the wireless communication system mentioned in the embodiments of the present application includes, but is not limited to, a 5G Mobile communication system and an Enhanced Mobile broadband (eMBB) of a next generation Mobile communication system, a URLLC, and a mass Machine-Type Communications (mtc).

In the embodiment of the present application, a Terminal (Terminal device) includes, but is not limited to, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), a Mobile phone (Mobile phone), a handset (handset), a portable device (portable equipment), and the like, and the Terminal may communicate with one or more core networks through a Radio Access Network (RAN), for example, the Terminal may be a Mobile phone (or referred to as a "cellular" phone), a computer with a wireless communication function, and the Terminal may also be a portable, pocket, hand-held, computer-embedded, or vehicle-mounted Mobile device or apparatus.

Fig. 1 is a schematic diagram of a communication system architecture provided in the present application.

Referring to fig. 1, a communication system 01 includes a network device 101 and a terminal 102; wherein the network device 101 is deployed using NSA mode. When the communication system 01 includes a core network, the network device 101 may also be connected to the core network. The network device 101 may also communicate with an Internet Protocol (IP) network 200, such as the Internet (Internet), a private IP network, or other data network. The network device provides services for terminals within the coverage area. For example, referring to fig. 1, network device 101 provides wireless access to one or more terminals within the coverage area of network device 101. In addition, the network devices can also communicate with each other.

The network device 101 may be a device for communicating with a terminal. The network device may be a relay station, an access point, a vehicle-mounted device, etc. In an end-to-end (D2D) communication system, the network Device may also be a terminal that functions as a base station. A terminal may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem with wireless communication capabilities, as well as various forms of User Equipment (UE), Mobile Station (MS), and the like.

In the related art, when a user defines a wake-up voice for waking up an intelligent device, the user usually takes the voice as a wake-up word of the intelligent device after inputting a specific piece of voice. However, the above-mentioned manner of waking up the smart device by user setting may have a problem of poor wake-up accuracy, thereby reducing user experience.

Based on the above problems, the present application provides an evaluation method, an evaluation device, an electronic device, and a medium for awakening audio data.

Fig. 2 schematically shows a flowchart of an evaluation method of wake-up audio data according to an embodiment of the present application. As shown in fig. 2, includes:

s101, acquiring awakening audio data, wherein the awakening audio data is used for awakening the target equipment.

It should be noted that, in the present application, the device for acquiring the wake-up audio data is not specifically limited, and may be, for example, an intelligent device or a server. The intelligent device may be a PC (Personal Computer), or a smart phone, a tablet Computer, an e-book reader, an MP3(Moving Picture Experts group audio layer III) device for sorting articles. MP4(Moving picture expert group Audio Layer IV), article collating device, portable terminal device with display function such as portable computer, etc.

Further, the target device is not specifically limited in this application. In one embodiment, it may be a smart device that operates based on smart voice. For example, the intelligent sound box can be a product of sound box upgrading and is a tool for household consumers to request songs and set alarm clocks by voice. Or, for example, with a smart phone, a user can use a voice assistant installed on the smart phone to play audio, perform internet shopping, learn weather forecast, and control smart home devices, such as opening a curtain, setting a refrigerator temperature, and raising a temperature of a water heater in advance.

Furthermore, the voice assistant of the smart phone is an intelligent mobile phone application, which can help the user to complete various functions and solve the living problems through the interactive function of intelligent conversation and instant question and answer.

When a user needs to use the target device, the device in sleep can be awakened in a way of awakening voice, so that the desired function is completed after the device is awakened. For example, when the mobile phone is in a dormant state or a screen-locked state, the mobile phone can directly enter a waiting instruction state by monitoring a designated wake-up sound (a set voice instruction, i.e., wake-up audio data) of the user, so that the first step of voice interaction can be started.

In addition, the present application does not specifically limit the wakeup audio data. It can be a voice segment set by the user according to the user's own idea. For example, one or more words, and the like. In one embodiment, the wake audio data may correspond to one or more different languages. For example, chinese, english, japanese, etc.

S102, based on the awakening audio data and a preset matching strategy, generating an evaluation index corresponding to the awakening audio data, wherein the evaluation index is used for representing the success rate of awakening the target device by the awakening audio data.

Further, after the awakening audio data is acquired, the preset matching strategy can be used for generating the evaluation index corresponding to the awakening audio data for the user. It will be appreciated that the results reflected by the evaluation index may be used to assist the user in determining whether the wake-up audio data is suitable as a wake-up voice for waking up the target device.

The matching policy is not specifically limited in the present application, that is, the evaluation index of the wake-up audio data may be generated according to any standard or condition.

And S103, displaying the evaluation index.

Further, the manner of displaying the evaluation index is not specifically limited in the present application, and for example, the evaluation index may be displayed on a display screen of a mobile phone. It can also be displayed on the display screen of the smart speaker, and the like.

Optionally, in a possible implementation manner of the present application, in the present application S102 (generating an evaluation index corresponding to the wake-up audio data), the evaluation index may be generated by:

and respectively carrying out similarity matching on the awakening audio data and each audio data in a preset matching database to generate a corresponding similarity matching value, wherein the audio data comprises voice data and noise data.

Further, the audio data stored in the matching database is not specifically limited in the present application. For example, one or more voice data, and one or more noise data. In one embodiment, if the wake-up audio data matches one or more noise data in the matching database successfully, it is indicative that the wake-up audio data has an excessive noise disturbance during the recording process. A result with a lower evaluation index may be generated for the audio data.

The noise data stored in the matching database is not specifically limited in the present application. For example, additive acoustic noise: namely the background environmental sound recorded by the microphone when recording voice. Or acoustic reverberation noise: the additive effect caused by multipath reflections. It can also be a convolutional channel effect noise: resulting in an uneven or bandwidth limited response that does not effectively model the communication channel when performing channel equalization in order to remove the channel impulse response. It can also be nonlinear distortion noise: such as improper gain at the signal input. Or it may be additive broadband electronic noise, electrical interference noise, distortion noise caused by insufficient wind frequency response, etc.

Further, in the embodiment of the present application, in the process of performing similarity matching on the wake-up audio data and each audio data in the preset matching database, the wake-up audio data and each audio data in the preset matching database may be obtained by any one or more of the following methods:

the first mode is as follows:

and respectively carrying out similarity matching on the awakening audio data and the audio data, and determining the number of first audio data corresponding to the awakening audio data, wherein the first audio data is audio data of which the pronunciation approximation degree exceeds a first threshold value.

First, the present application does not specifically limit the voice data in the matching database. For example, may include voice data that occurs in all scenes of a movie. But may also include voice data present in all songs. But also all conversational speech data in a daily life scenario.

It can be understood that, after the wake-up audio data and the audio data are respectively subjected to similarity matching, corresponding similarity matching values can be generated according to the number of the first audio data. In one embodiment, when the more audio data (the number of the first audio data) of the one or more audio data whose pronunciation approximation with the wake-up audio data exceeds the first threshold is detected, the more audio data similar or identical to the pronunciation of the wake-up audio data in the representative matching database is. That is, the wake-up audio data may occur in multiple pronunciation scenarios. Therefore, in order to avoid the situation of false awakening, the evaluation index can be correspondingly lowered for the awakened audio data with excessive first audio data.

For example, taking the wake-up audio data as "i am," it is understood that the "i am" audio data may appear in multiple pronunciation scenes (matching database) (e.g., movie scenes, song scenes, life conversation scenes, work conversation scenes, etc.). The wake-up audio data ("i") is therefore also higher in the number of audio data (the number of first audio data) that are similar or identical in pronunciation (the pronunciation approximation exceeds the first threshold) in the matching database. Thus, to avoid a situation where the device is mistakenly awakened due to the target device detecting "i'm" wake-up voice data in a situation where the user does not want to wake-up the target device. The method and the device can generate a matching result with the evaluation index being 'unsuitable to be used as awakening voice data'.

As another example, the audio data of "peaches classmate" is taken as an example, and it can be understood that the audio data of "peaches classmate" is less frequently appeared in a plurality of pronunciation scenes (matching database) (e.g. movie scenes, song scenes, life conversation scenes, work conversation scenes, etc.). The amount of audio data (the amount of first audio data) that is similar or identical in pronunciation (pronunciation approximation exceeds the first threshold) to the wake-up audio data ("peach classmates") in the matching database is also small. Therefore, in order to reduce the false wake-up of the device, the matching result with the evaluation index "suitable for wake-up voice data" can be generated according to the matching result.

The first threshold is not particularly limited in the present application, and may be, for example, 50%, 80%, or the like.

The second mode is as follows:

and respectively carrying out similarity matching on the awakening audio data and the audio data, and determining the quantity of second audio data corresponding to the awakening audio data, wherein the semantic similarity of the second audio data and the awakening audio data exceeds a second threshold value.

It can be understood that, after the wake-up audio data and the audio data are respectively subjected to similarity matching, corresponding similarity matching values can be generated according to the number of the second audio data. In one embodiment, when the more audio data (the number of second audio data) of the one or more audio data whose pronunciation approximation with the wake-up audio data exceeds the second threshold is detected, the more audio data which is semantically similar or identical to the wake-up audio data in the database is represented as the matching database. That is, the wake-up audio data may contain multiple meanings at the same time. Thus representing that the wake-up audio data may appear in multiple pronunciation scenarios. Therefore, in order to avoid the situation of false awakening, the evaluation index can be correspondingly lowered for the awakened audio data with excessive second audio data.

For example, taking the awakening audio data as a "bundle" for example, it can be understood that there are a plurality of semantics (which may be wrapped in cloth, mental burden, or vocal terms for a laughing audience) in the audio data of the "bundle". The amount of audio data (the amount of second audio data) that is semantically similar or identical (semantic proximity exceeds a second threshold) to the wake audio data ("bundle") in the matching database is also higher. It is understood that when the semantic number of the wake-up audio data ("bundle") is large, it means that the wake-up audio data may appear in multiple pronunciation occasions. Therefore, to avoid the situation that the target device detects the wake-up voice data of the "bundle" and the device is woken up by mistake in the case that the user does not want to wake up the target device. The method and the device can generate a matching result with the evaluation index being 'unsuitable to be used as awakening voice data'.

Further, for example, the awakening audio data is taken as "accounting", it can be understood that there are a plurality of semantics in the audio data of "accounting" (it may be calculated on the basis of accounting to reflect the business activity and achievement of the enterprise or the balance or the budget fund of the administrative institution, or it may be calculated as a result of a loss or a failure and a person dispute for a correct amount of reimbursement). The amount of audio data (the amount of second audio data) that is semantically similar or identical (the semantic proximity exceeds a second threshold) to the wake-up audio data ("accounting") in the matching database is also higher. It will be appreciated that when the wake-up audio data ("accounting") has a large semantic number, it represents that the wake-up audio data may appear in multiple pronunciation scenarios. Therefore, to avoid a situation where the device is awoken by mistake due to the target device detecting wake-up voice data for "accounting" in a case where the user does not want to wake up the target device. The method and the device can generate a matching result with the evaluation index being 'unsuitable to be used as awakening voice data'.

It should also be noted that the second threshold is not specifically limited in this application, and may be, for example, 50%, 80%, or the like.

And generating an evaluation index based on the size relationship between the similarity matching result and the preset number.

It is to be understood that the similarity matching result in the present application is obtained from the number of the first audio data and/or the number of the second audio data. In one embodiment, the corresponding evaluation index may be generated according to a magnitude relation between the number of the first audio data and/or the number of the second audio data and a preset number.

It should also be noted that the preset number is not specifically limited in this application, and for example, the number may be 5, or 10.

and analyzing the awakening audio data to obtain audio characteristic parameters corresponding to the awakening audio data, wherein the audio characteristic parameters are used for reflecting pronunciation information of the awakening audio data.

And generating an evaluation index based on the audio characteristic parameters and the matching strategy.

Further, in the process of generating the evaluation index based on the audio characteristic parameters and the matching strategy, the evaluation index can be obtained through the following two ways:

the first mode is as follows:

determining pronunciation duration and pronunciation speed corresponding to awakening audio data based on the audio characteristic parameters;

and generating an evaluation index based on the matching relationship between the pronunciation duration and the first preset condition and the matching relationship between the pronunciation speed and the second preset condition.

Further, in the process of generating the wake-up audio data by the user, whether the wake-up audio data is suitable as the wake-up audio data may be further determined based on the pronunciation duration and pronunciation speed corresponding to the wake-up audio data. It is understood that, for the pronunciation duration, when the pronunciation of the wake-up audio data is too long or too short, the evaluation index as the wake-up word is affected. Similarly, when the pronunciation speed of the wake-up audio data is too fast or too slow, the evaluation index as the wake-up word is also affected.

The second mode is as follows:

determining language information and intonation information corresponding to awakening audio data based on the audio characteristic parameters;

and generating an evaluation index based on the matching relationship between the language information and the third preset condition and the matching relationship between the tone information and the fourth preset condition.

Further, in the process of generating the wake-up audio data by the user, whether the wake-up audio data is suitable as the wake-up audio data may be determined based on language information and intonation information corresponding to the wake-up audio data. It is understood that, for language information, specific language information may be fixedly selected as a factor that affects the evaluation index thereof. Similarly, when the intonation of the wake audio data is too high or too low, the evaluation index as the wake word is also affected.

Optionally, in a possible embodiment of the present application, after the application S101 (acquiring the wake-up audio data), as shown in fig. 3, a method for evaluating the wake-up audio data is further included:

s201, acquiring awakening audio data.

And S202, acquiring the awakening biological characteristics of the target user by using the camera shooting acquisition unit.

S203, based on the awakening audio data, the awakening biological characteristics and the matching strategy, generating an evaluation index corresponding to the awakening audio data.

Further, the awakening biometric feature is not specifically limited in the present application, and may be, for example, one or more of facial feature information, iris feature information, and fingerprint feature information of the target user. So as to generate the evaluation index corresponding to the awakening audio data together according to the awakening audio data and the awakening biological characteristics.

For example, after the wake-up audio data is obtained, a camera of the terminal and/or a fingerprint sensor or other collecting device may be used to collect biometric information (i.e., at least one of facial feature information, iris feature information, and fingerprint feature information) of the target user. And after the corresponding information is collected, mapping the biological characteristic information and the corresponding awakening audio data, and taking the biological characteristic information and the corresponding awakening audio data as an awakening instruction for awakening the target equipment.

Taking the awakening biological feature as the facial feature information as an example, for the terminal, after the facial image of the target user is acquired by using the camera shooting and collecting device, the feature information of the facial image can be extracted by using the neural network model. It should be noted that, the preset neural network model is not specifically limited in the present application, and in a possible implementation, the feature recognition may be performed on the facial image by using a convolutional neural network model.

Among them, Convolutional Neural Networks (CNN) are a kind of feed forward Neural Networks (fed forward Neural Networks) containing convolution calculation and having a deep structure, and are one of the representative algorithms of deep learning. The convolutional neural network has a representation learning (representation learning) capability, and can perform translation invariant classification on input information according to a hierarchical structure of the convolutional neural network. The CNN (convolutional neural network) has remarkable effects in the fields of image classification, target detection, semantic segmentation and the like due to the powerful feature characterization capability of the CNN on the image.

Further, according to the method and the device, after the awakening audio data uploaded by the user and the facial features are detected, the corresponding evaluation index can be generated according to the definition of the facial features. It is understood that when the definition of the facial feature is higher, the corresponding evaluation index is higher.

And S204, displaying the evaluation index.

Taking fig. 4a to 4d as examples, taking a target device as a mobile phone as an example, after the mobile phone obtains wake-up audio data for waking up the target device (as shown in fig. 4 a), an evaluation index (as shown in fig. 4 b) corresponding to the wake-up audio data and used for representing the success rate of waking up the target device by the wake-up audio data may be generated based on the wake-up audio data and a preset matching policy. Further, the evaluation index can be displayed by a display unit of the mobile phone (as shown in fig. 4c and as shown in fig. 4 d).

In another embodiment of the present application, as shown in fig. 5, the present application further provides an apparatus for evaluating wake-up audio data. The device comprises an acquisition module 301, a generation module 302 and a display module 303, wherein:

an obtaining module 301 configured to obtain wake-up audio data, where the wake-up audio data is used to wake up a target device;

a generating module 302, configured to generate an evaluation index corresponding to the wake-up audio data based on the wake-up audio data and a preset matching policy, where the evaluation index is used to represent a success rate of the wake-up audio data waking up the target device;

a presentation module 303 configured to present the evaluation index.

In another embodiment of the present application, the generating module 302 further includes:

a generating module 302, configured to perform similarity matching on the wake-up audio data and each audio data in a preset matching database, where the audio data includes voice data and noise data;

a generating module 302 configured to generate the evaluation index based on a magnitude relationship between the similarity matching result and a preset number.

a generating module 302, configured to perform similarity matching on the wake-up audio data and the audio data respectively, and determine the number of first audio data corresponding to the wake-up audio data, where the pronunciation approximation degree of the first audio data and the wake-up audio data exceeds a first threshold;

and/or the presence of a gas in the gas,

a generating module 302, configured to perform similarity matching on the wake-up audio data and the audio data, and determine the number of second audio data corresponding to the wake-up audio data, where the semantic similarity of the second audio data to the wake-up audio data exceeds a second threshold.

a generating module 302, configured to analyze the wake-up audio data to obtain an audio characteristic parameter corresponding to the wake-up audio data, where the audio characteristic parameter is used to reflect pronunciation information of the wake-up audio data;

a generating module 302 configured to generate the evaluation index based on the audio feature parameter and the matching policy.

a generating module 302 configured to determine a pronunciation duration and a pronunciation speed corresponding to the wake-up audio data based on the audio characteristic parameters;

the generating module 302 is configured to generate the evaluation index based on a matching relationship between the pronunciation duration and a first preset condition and a matching relationship between the pronunciation speed and a second preset condition.

a generating module 302 configured to determine language information and intonation information corresponding to the wake-up audio data based on the audio characteristic parameter;

a generating module 302, configured to generate the evaluation index based on a matching relationship between the language information and a third preset condition and a matching relationship between the intonation information and a fourth preset condition.

a generating module 302 configured to acquire a wake-up biometric feature of a target user by using a camera capturing unit, wherein the wake-up biometric feature corresponds to at least one of a facial feature and a gesture feature;

a generating module 302 configured to generate an evaluation index corresponding to the wake audio data based on the wake audio data, the wake biometric, and the matching policy.

FIG. 6 is a block diagram illustrating a logical structure of an electronic device in accordance with an exemplary embodiment. For example, the electronic device 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, electronic device 400 may include one or more of the following components: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 402 is configured to store at least one instruction for execution by the processor 401 to implement the interactive special effect calibration method provided by the method embodiments of the present application.

In some embodiments, the electronic device 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, touch screen display 405, camera 406, audio circuitry 407, positioning components 408, and power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be one, providing the front panel of the electronic device 400; in other embodiments, the display screen 405 may be at least two, respectively disposed on different surfaces of the electronic device 400 or in a folded design; in still other embodiments, the display screen 405 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the electronic device 400. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate a current geographic location of the electronic device 400 to implement navigation or LBS (location based Service). The positioning component 408 may be a positioning component based on the GPS (global positioning System) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union.

The power supply 409 is used to supply power to the various components in the electronic device 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When power source 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic apparatus 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the touch display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the electronic device 400, and the gyro sensor 412 may cooperate with the acceleration sensor 411 to acquire a 3D motion of the user on the electronic device 400. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensors 413 may be disposed on a side bezel of the electronic device 400 and/or on a lower layer of the touch display screen 405. When the pressure sensor 413 is arranged on the side frame of the electronic device 400, a holding signal of the user to the electronic device 400 can be detected, and the processor 401 performs left-right hand identification or shortcut operation according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of the user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 401 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, collating items, changing settings, and the like. The fingerprint sensor 414 may be disposed on the front, back, or side of the electronic device 400. When a physical button or vendor Logo is provided on the electronic device 400, the fingerprint sensor 414 may be integrated with the physical button or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 based on the ambient light intensity collected by the optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 405 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

Proximity sensor 416, also known as a distance sensor, is typically disposed on the front panel of electronic device 400. The proximity sensor 416 is used to capture the distance between the user and the front of the electronic device 400. In one embodiment, the processor 401 controls the touch display screen 405 to switch from the bright screen state to the dark screen state when the proximity sensor 416 detects that the distance between the user and the front surface of the electronic device 400 gradually decreases; when the proximity sensor 416 detects that the distance between the user and the front of the electronic device 400 is gradually increased, the processor 401 controls the touch display screen 405 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device 400, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as the memory 404, comprising instructions executable by the processor 420 of the electronic device 400 to perform the above-described method of waking up a device, the method comprising: acquiring awakening audio data, wherein the awakening audio data is used for awakening target equipment; generating an evaluation index corresponding to the awakening audio data based on the awakening audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of awakening the target device by the awakening audio data; and displaying the evaluation index. Optionally, the instructions may also be executable by the processor 420 of the electronic device 400 to perform other steps involved in the exemplary embodiments described above. Optionally, the instructions may also be executable by the processor 420 of the electronic device 400 to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided an application/computer program product comprising one or more instructions executable by the processor 420 of the electronic device 400 to perform the above-described method of waking up a device, the method comprising: acquiring awakening audio data, wherein the awakening audio data is used for awakening target equipment; generating an evaluation index corresponding to the awakening audio data based on the awakening audio data and a preset matching strategy, wherein the evaluation index is used for representing the success rate of awakening the target device by the awakening audio data; and displaying the evaluation index. Optionally, the instructions may also be executable by the processor 420 of the electronic device 400 to perform other steps involved in the exemplary embodiments described above. Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for evaluating wake-up audio data, comprising:

and displaying the evaluation index.

2. The method of claim 1, wherein the generating the rating index corresponding to the wake audio data comprises:

respectively carrying out similarity matching on the awakening audio data and each audio data in a preset matching database, wherein the audio data comprises voice data and noise data;

and generating the evaluation index based on the size relationship between the similarity matching result and the preset number.

3. The method of claim 2, wherein the similarity matching of the wake-up audio data and each audio data in a preset matching database comprises:

similarity matching is carried out on the awakening audio data and the audio data respectively, and the number of first audio data corresponding to the awakening audio data is determined, wherein the first audio data is audio data of which the pronunciation similarity with the awakening audio data exceeds a first threshold value;

and/or the presence of a gas in the gas,

4. The method of claim 1, wherein the generating the rating index corresponding to the wake audio data comprises:

analyzing the awakening audio data to obtain audio characteristic parameters corresponding to the awakening audio data, wherein the audio characteristic parameters are used for reflecting pronunciation information of the awakening audio data;

and generating the evaluation index based on the audio characteristic parameters and the matching strategy.

5. The method of claim 4, wherein the generating the rating index based on the audio feature parameters and the matching policy comprises:

determining pronunciation duration and pronunciation speed corresponding to the awakening audio data based on the audio characteristic parameters;

and generating the evaluation index based on the matching relationship between the pronunciation duration and a first preset condition and the matching relationship between the pronunciation speed and a second preset condition.

6. The method of claim 4 or 5, wherein the generating the rating index based on the audio feature parameters and the matching policy comprises:

determining language information and intonation information corresponding to the awakening audio data based on the audio characteristic parameters;

and generating the evaluation index based on the matching relationship between the language information and a third preset condition and the matching relationship between the tone information and a fourth preset condition.

7. The method of claim 1, after the obtaining wake-up audio data, further comprising:

acquiring a wake-up biological characteristic of a target user by using a camera shooting acquisition unit, wherein the wake-up biological characteristic corresponds to at least one of a facial characteristic and a gesture characteristic;

and generating an evaluation index corresponding to the awakening audio data based on the awakening audio data, the awakening biological characteristics and the matching strategy.

8. An apparatus for evaluating wake-up audio data, comprising:

a presentation module configured to present the evaluation index.

9. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the operations of the method of assessing wake up audio data according to any of claims 1 to 7.

10. A computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the method for evaluating wake audio data according to any one of claims 1 to 7.