CN116450080A

CN116450080A - Interactive capability determining method and device, storage medium and electronic device

Info

Publication number: CN116450080A
Application number: CN202310335567.3A
Authority: CN
Inventors: 侯玉坤
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-18

Abstract

The application discloses a method and a device for determining interaction capability, a storage medium and an electronic device, and relates to the technical field of smart families, wherein the method for determining interaction capability comprises the following steps: starting a recording task of interactive voice between the target object and the voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends wake-up audio to the voice interaction equipment; ending the recording task at a second time point to obtain audio response data of the voice interaction device between the first time point and the second time point, wherein the time difference value between the second time point and the first time point is a preset time threshold value, and the audio response data comprises: the method comprises the steps of a first response result of voice interaction equipment to wake-up audio, response time of the voice interaction equipment to the wake-up audio, and a second response result of the voice interaction equipment to a voice instruction sent by a target object; and determining the interaction capability of the voice interaction device according to the audio response data.

Description

Interactive capability determining method and device, storage medium and electronic device

Technical Field

The application relates to the technical field of smart families, in particular to a method and a device for determining interaction capability, a storage medium and an electronic device.

Background

With the advent of artificial intelligence technology, various artificial intelligence devices have begun to enter thousands of households. The concept of artificial intelligence which is expected to be inexhaustible for the average person in the past is now embodied into intelligent households such as sound boxes, glasses, floor sweeping machines and the like and becomes an indispensable part of people.

As a typical representative of artificial intelligence products, smart speakers are gradually changed from a toy pursuing freshness to an entrance for interaction between smart home scenes and users, so that the requirements and demands of users on smart speakers are also increasing, wherein the primary manifestation is wake-up and interaction. The wake-up success rate refers to the probability that the sound box is wake-up successfully after a user sends a wake-up instruction to the sound box; the wake-up time delay is a time interval from when the user sends a wake-up instruction to the sound box to when the sound box is successfully waken up and restored; the interaction time delay refers to the time interval from when the user sends an interaction instruction to the sound box to when the sound box successfully interacts.

For the performance index test of the interactive capability, a large number of repeated tests are required to be carried out for each test, and the current success rate test method is to manually judge whether the wake-up is successful or not; the wake-up time delay test method is to manually play the example corpus, upload the log through the terminal and analyze the interaction response time according to the key point information given by sdk. The wake-up time delay process is complex by manually judging whether wake-up is successful or not and analyzing the terminal log, and the time consumption is long and the efficiency is low due to the fact that manual operation is relied on.

Aiming at the problems of long time consumption, low testing efficiency and the like in the prior art, through manual measurement and analysis of the interactive performance index of the intelligent sound box, no effective solution has been proposed yet.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining interaction capability, a storage medium and an electronic device, which are used for at least solving the problems of long time consumption and low test efficiency in the prior art by manually measuring and analyzing the interaction performance index of an intelligent sound box.

According to an embodiment of the present invention, there is provided a method for determining interactive capability, including: starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends wake-up audio to the voice interaction equipment; ending the recording task at a second time point to obtain audio response data of the voice interaction device between the first time point and the second time point, wherein a time difference value between the second time point and the first time point is a preset time threshold value, and the audio response data comprises: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object; and determining the interaction capability of the voice interaction equipment according to the audio response data.

In one exemplary embodiment, determining the interactive capability of the voice interaction device from the audio response data comprises: under the condition that the interaction capability comprises wake-up time delay and interaction time delay, carrying out noise reduction processing on the audio response data, and carrying out audio segmentation on the noise-reduced audio response data to obtain a plurality of audio fragments of the audio response data; determining an audio fragment with the audio time length smaller than a preset time length in the plurality of audio fragments as a noise fragment, and deleting the noise fragment from the plurality of audio fragments to obtain a plurality of voice interaction fragments; and determining the interactive capability of the voice interaction equipment according to the starting time and the ending time of the voice interaction fragments in the audio response data.

In an exemplary embodiment, the noise reduction processing is performed on the audio response data, and the audio segmentation is performed on the noise reduced audio response data, so as to obtain a plurality of audio fragments of the audio response data, including: removing noise data in the audio response data according to a preset sound threshold value to obtain the noise-reduced audio response data, wherein the preset sound threshold value is the maximum decibel value of noise in a preset test scene; and carrying out audio segmentation on the noise-reduced audio response data according to the mute segments to obtain the plurality of audio segments, wherein the mute segments are audio with the mute time length larger than a preset time threshold in the noise-reduced audio response data, and the plurality of audio segments do not contain the mute segments.

In an exemplary embodiment, the audio response data further includes a voice instruction issued by the target object, and determining the interactive capability of the voice interaction device according to the start time and the end time of the plurality of voice interaction fragments in the audio response data includes: determining a third time point and a fourth time point corresponding to the voice instruction, wherein the third time point is a starting time point of the voice instruction, and the fourth time point is an ending time point of the voice instruction; determining a difference between a starting time point of a first voice interaction segment and the first time point as the wake-up time delay, and determining a difference between a starting time point of a second voice interaction segment and the fourth time point as the interaction time delay, wherein the first voice interaction segment is a voice interaction segment between the first time point and the third time point in the voice interaction segments, and the second voice interaction segment is a voice interaction segment between the fourth time point and the second time point in the voice interaction segments.

In one exemplary embodiment, determining the interactive capability of the voice interaction device from the audio response data comprises: and under the condition that the interaction capability comprises a wake-up success rate, circularly executing a starting step and an ending step to obtain a plurality of audio response data, wherein the starting step comprises the following steps: starting a recording task of interactive voice between the target object and the voice interaction equipment at the first time point; the ending step comprises the following steps: ending the recording task at a second time point to obtain audio response data of the voice interaction equipment between the first time point and the second time point; and determining the wake-up success rate of the voice interaction equipment according to the wake-up states corresponding to the plurality of audio response data.

In an exemplary embodiment, determining a wake-up success rate of the voice interaction device according to wake-up states corresponding to the plurality of audio response data includes: under the condition that the first voice interaction fragment exists, determining that the wake-up state corresponding to the audio response data is wake-up success; determining the number of first audio response data with the wake-up state being successful in wake-up in the plurality of audio response data; and determining the wake-up success rate according to the quantity of the plurality of first audio response data.

In an exemplary embodiment, before starting the recording task of the interactive voice between the target object and the voice interaction device at the first point in time, the method further comprises: acquiring a plurality of historical audio response data of the voice interaction device, wherein the wake-up state corresponding to the historical audio response data is wake-up success; and determining an average value of the audio time durations of the plurality of historical audio response data as the preset time threshold.

According to another embodiment of the present invention, there is also provided an apparatus for determining interaction capability, including: the starting module is used for starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends out wake-up audio to the voice interaction equipment; the ending module is configured to end the recording task at a second time point, and obtain audio response data of the voice interaction device between the first time point and the second time point, where a time difference value between the second time point and the first time point is a preset time threshold, and the audio response data includes: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object; and the determining module is used for determining the interaction capability of the voice interaction equipment according to the audio response data.

According to a further aspect of embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-described method of determining interactive capability when run.

According to still another aspect of the embodiments of the present invention, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above-mentioned method for determining the interactive capability through the computer program.

In the embodiment of the application, a recording task of interactive voice between a target object and voice interaction equipment is started at the ending time point (namely, a first time point) when the target object sends wake-up audio to the voice interaction equipment; and ending the recording task at a second time point to obtain audio response data between the first time point and the second time point, wherein the time difference between the second time point and the first time point is a preset time threshold, and the audio response data comprises: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to a voice instruction sent by the target object; determining the interactive capability of the voice interactive equipment according to the obtained audio response data; through the scheme, the problems of long time consumption and low testing efficiency in the related technology are solved by manually measuring and analyzing the interactive performance index of the intelligent sound box; the intelligent sound box interaction performance index automatic test and analysis method has the advantages that the intelligent sound box interaction performance index automatic test and analysis method achieves the technical effect of improving test efficiency.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of an alternative method of determining interactive capability according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative method of determining interactive capability according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an alternative intelligent speaker test flow according to an embodiment of the present invention;

FIG. 4 is a flow chart of an alternative method of determining interactive capability according to an embodiment of the invention;

fig. 5 is a block diagram of an alternative interaction capability determination apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of the embodiments of the present application, a method for determining interaction capability is provided. The method for determining the interaction capability is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (Intelligence House) ecology and the like. Alternatively, in the present embodiment, the above-described method of determining the interactive capability may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, a refrigerator device, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.

In this embodiment, a method for determining an interactive capability is provided and applied to a computer terminal, and fig. 2 is a flowchart of an alternative method for determining an interactive capability according to an embodiment of the present invention, where the flowchart includes the following steps:

Step S202, starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends out wake-up audio to the voice interaction equipment;

it should be noted that the target object may be an audio playing device that automatically plays the wake-up corpus (corresponding to the wake-up audio), or may be a user, which is not limited in this application. Wherein the target object (audio playing device/user) can also automatically play the interaction corpus.

Step S204, ending the recording task at a second time point, and obtaining audio response data of the voice interaction device between the first time point and the second time point, where a time difference between the second time point and the first time point is a preset time threshold, where the audio response data includes: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object;

step S206, determining the interactive capability of the voice interactive equipment according to the audio response data.

Optionally, the above step S206 is performed: the method for determining the interactive capability of the voice interactive device according to the audio response data comprises the following steps: under the condition that the interaction capability comprises wake-up time delay and interaction time delay, carrying out noise reduction processing on the audio response data, and carrying out audio segmentation on the noise-reduced audio response data to obtain a plurality of audio fragments of the audio response data; determining an audio fragment with the audio time length smaller than a preset time length in the plurality of audio fragments as a noise fragment, and deleting the noise fragment from the plurality of audio fragments to obtain a plurality of voice interaction fragments; and determining the interactive capability of the voice interaction equipment according to the starting time and the ending time of the voice interaction fragments in the audio response data.

If the interaction capability of the test comprises wake-up time delay and interaction time delay, the test result of the test can be obtained by only carrying out one test, and noise reduction treatment is carried out on the audio response data to avoid the interference of noise on the test result; and performing audio segmentation on the noise-reduced audio response data to obtain a plurality of audio fragments of the audio response data, determining the audio fragments with the audio time length smaller than the preset time length in the plurality of audio fragments as noise fragments, deleting the noise fragments, and finally determining the interaction capability of the voice interaction equipment according to the starting time and the ending time of the plurality of voice interaction fragments in the audio response data.

It can be understood that the noise reduction process removes noise constantly existing in the environment, but unexpected noise, such as falling of objects on the ground, is inevitably generated in the test process, so that noise points exist in the noise-reduced audio response data, the noise points are short in time and are not voice fragments sent by interaction, and therefore, the audio fragments with the audio duration smaller than the preset duration in the plurality of audio fragments are determined as noise fragments and deleted.

Optionally, the above noise reduction step and the segmentation step are performed: noise reduction processing is carried out on the audio response data, and audio segmentation is carried out on the noise-reduced audio response data to obtain a plurality of audio fragments of the audio response data, wherein the noise reduction processing comprises the following steps: removing noise data in the audio response data according to a preset sound threshold value to obtain the noise-reduced audio response data, wherein the preset sound threshold value is the maximum decibel value of noise in a preset test scene; and carrying out audio segmentation on the noise-reduced audio response data according to the mute segments to obtain the plurality of audio segments, wherein the mute segments are audio with the mute time length larger than a preset time threshold in the noise-reduced audio response data, and the plurality of audio segments do not contain the mute segments.

It can be understood that the noise reduction process is to remove the background noise in the test environment, and it can be understood that the noise for performance test is fixedly present in the environment, and the decibel value of the noise is stable in a certain range and is lower than the decibel of the interactive audio, so that the noise reduction process can be realized by setting a preset sound threshold, and the preset sound threshold is the maximum decibel value of the noise in the preset test scene; and then, carrying out audio segmentation on the noise-reduced audio response data according to the mute segments to obtain a plurality of audio segments, wherein the mute segments are audio with the mute time longer than a preset time threshold, and can be understood as pause time segments in the interaction process, and the audio response data are segmented in a mode of removing the mute segments, so that the plurality of audio segments finally obtained do not contain the mute segments.

Optionally, the audio response data further includes a voice command sent by the target object, and determining the interactive capability of the voice interaction device according to the starting time and the ending time of the plurality of voice interaction fragments in the audio response data includes: determining a third time point and a fourth time point corresponding to the voice instruction, wherein the third time point is a starting time point of the voice instruction, and the fourth time point is an ending time point of the voice instruction; determining a difference between a starting time point of a first voice interaction segment and the first time point as the wake-up time delay, and determining a difference between a starting time point of a second voice interaction segment and the fourth time point as the interaction time delay, wherein the first voice interaction segment is a voice interaction segment between the first time point and the third time point in the voice interaction segments, and the second voice interaction segment is a voice interaction segment between the fourth time point and the second time point in the voice interaction segments.

It can be understood that the wake-up time delay refers to a time interval from when the user sends a wake-up instruction (corresponding to the wake-up audio) to when the sound box is successfully waken up and returns, and the interaction time delay refers to a time interval from when the user sends an interaction instruction (corresponding to the voice instruction) to when the sound box is successfully interacted with; therefore, it is necessary to determine the time of emission of the wake-up audio and the time of emission of the interactive audio, and the wake-up process is distinguished from the interactive process by a voice command issued by a target object, which is followed by the interactive process. Therefore, a third time point at which the voice command starts and a fourth time point at which the voice command ends need to be determined; dividing the voice interaction fragments into a first voice interaction fragment of a wake-up stage and a second voice interaction fragment of an interaction stage through a third time point and a fourth time point, and finally determining wake-up time delay as the time difference between the starting time point of the first voice interaction fragment and the first time point; the interaction time delay is the difference between the second voice interaction fragment and the fourth time point.

It should be noted that in the actual test process, the number of the voice commands issued by the target object may be plural, at this time, the first one of the plural voice commands is taken as the dividing point of the wake-up phase and the interaction phase, and the plural interaction time delays need to be determined respectively at this time, so the interaction phase may be divided into plural sub-interaction phases according to the starting time points and the ending time points of the plural voice commands, and plural interaction time delays may be determined respectively according to the starting time points of the voice interaction fragments of the plural sub-interaction phases and the starting time points of the plural sub-interaction phases.

Optionally, the above step S206 is performed: the interactive capability of the voice interactive equipment is determined according to the audio response data, and the method can be realized by the following scheme: comprising the following steps: and under the condition that the interaction capability comprises a wake-up success rate, circularly executing a starting step and an ending step to obtain a plurality of audio response data, wherein the starting step comprises the following steps: starting a recording task of interactive voice between the target object and the voice interaction equipment at the first time point; the ending step comprises the following steps: ending the recording task at a second time point to obtain audio response data of the voice interaction equipment between the first time point and the second time point; and determining the wake-up success rate of the voice interaction equipment according to the wake-up states corresponding to the plurality of audio response data.

If the interactive capability includes a wake-up success rate, a start step and an end step may be cyclically executed to obtain a plurality of audio response data, where the start step is: starting a recording task of interactive voice between the target object and the voice interaction equipment at a first time point; the ending steps are as follows: ending the recording task at a second time point to obtain audio response data of the voice interaction equipment between the first time point and the second time point; and finally, determining the wake-up success rate of the voice interaction equipment according to the wake-up states corresponding to the plurality of audio response data.

It can be understood that in the actual testing process, the single test result cannot explain the problem, and multiple tests are required to be performed to show the actual interaction capability of the voice interaction device; thus cyclically executing the above-described start and end steps. And performing multiple tests, recording and storing the audio in the test process, and analyzing the audio to obtain an analysis result of the interaction capability.

Optionally, the determining step determines the wake-up success rate of the voice interaction device according to the wake-up states corresponding to the plurality of audio response data, including: under the condition that the first voice interaction fragment exists, determining that the awakening state corresponding to the audio response data is awakening success; determining the number of first audio response data with the wake-up state being successful in wake-up in the plurality of audio response data; and determining the wake-up success rate according to the quantity of the plurality of first audio response data.

It can be understood that if the first voice interaction segment exists, that is, the wake-up response voice exists, that is, the wake-up is successful, so that the wake-up state of the test corresponding to the audio response data is determined to be wake-up success; and finally, calculating the wake-up success rate according to the ratio of the number of the first audio response data to the number of the plurality of the audio response data.

Optionally, the above step S202 is performed: before starting the recording task of the interactive voice between the target object and the voice interaction device at the first time point, the method further comprises: acquiring a plurality of historical audio response data of the voice interaction device, wherein the wake-up state corresponding to the historical audio response data is wake-up success; and determining an average value of the audio time durations of the plurality of historical audio response data as the preset time threshold.

In order to ensure the high efficiency of the test, a proper ending time is required to be determined for the recording task so as to ensure that the test can be completed and useless audio can not be recorded; therefore, firstly, a plurality of historical audio response data of the voice interaction device or the voice interaction device is obtained, the wake-up states corresponding to the historical audio response data are all wake-up success, and the average value of the audio duration of the test case which is about to wake-up success is used as the preset time threshold.

Optionally, the above step S202 is performed: before starting the recording task of the interactive voice between the target object and the voice interaction device at the first time point, the method further comprises: and performing audio cutting on the wake-up audio to remove tail sounds of the wake-up audio.

In order to make measurement of wake-up time delay more accurate, wake-up audio can be preprocessed first, and tail sounds in the wake-up audio are removed.

The embodiment of the application also provides a schematic diagram of a testing flow of the intelligent sound box, as shown in fig. 3, specifically comprising the following steps:

step 1: establishing a corpus play library and a voice processing file library;

the example corpus is stored in the response path in advance, and the program automatically calls the example corpus to play in the test process without manually clicking the example corpus to play. And then, under the simulated actual use scene, realizing a test flow of wake-up interaction cycle.

Step 2: starting a test task and playing wake-up corpus (equivalent to the wake-up audio);

step 3: starting a recording task at the moment of waking up to expect the completion of playing;

step 4: starting to play the interactive corpus (corresponding to the voice command) at a fixed time point;

step 5: ending the recording task when the set recording duration (corresponding to the preset time threshold) arrives;

Step 6: judging whether the ending condition is met, if not, executing the return to executing the step 2; if yes, executing the step 7;

step 7: multiple pieces of recorded audio (corresponding to the multiple pieces of audio response data) are acquired and analyzed to determine the interactive capabilities of the voice device.

After each cycle is finished, the program automatically analyzes the voice in the wake-up and interaction process, judges whether the intelligent sound box is successfully waken up or not and wake-up time delay and interaction time delay through the processes of voice noise reduction, segmentation, voice recognition and the like, and finally records the result and the audio path before the next cycle.

Based on the above process, the embodiment of the present application further provides an optional process for determining the interactive capability, as shown in fig. 4, which specifically includes the following steps:

step 1: the corpus playing and voice processing file paths are set in advance, and noise thresholds (corresponding to the preset sound threshold) and mute parameters (corresponding to the preset time threshold) of different stages are set;

step 2: randomly playing the wake-up corpus and recording a sound box recovery process; randomly playing the interaction corpus and recording the interaction process of the sound box;

step 3: noise reduction is carried out on the record;

step 4: cutting the recording of the awakening process according to the noise threshold value and the mute parameter, calculating the number of voice fragments, judging whether to awaken according to the number, and recording the time interval between the corresponding fragments, namely the awakening time delay, if so;

Step 5: and storing the test result into an Excel file, and continuously playing the corpus to perform the next test.

Through the steps, the corpus and the analysis process can be automatically played in the whole process, repeated cyclic tests can be performed, manual monitoring is not needed, the test process is recorded, analysis is convenient, inquiry can be looked back, whether the program runs correctly is judged, and debugging is convenient.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present invention.

In this embodiment, a device for determining an interaction capability is further provided, and the device for determining an interaction capability is used to implement the foregoing embodiments and preferred embodiments, which are not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

FIG. 5 is a block diagram of an alternative interaction capability determination apparatus according to an embodiment of the present invention; as shown in fig. 5, includes:

the starting module 52 is configured to start a recording task of an interactive voice between a target object and a voice interaction device at a first time point, where the first time point is an ending time point when the target object sends a wake-up audio to the voice interaction device;

it should be noted that the target object may be an audio playing device that automatically plays the wake-up corpus (corresponding to the wake-up audio), or may be a user, which is not limited in this application.

An ending module 54, configured to end the recording task at a second time point, and obtain audio response data of the voice interaction device between the first time point and the second time point, where a time difference between the second time point and the first time point is a preset time threshold, where the audio response data includes: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object;

A determining module 56 is configured to determine the interactive capability of the voice interaction device according to the audio response data.

By the device, the recording task of the interactive voice between the target object and the voice interaction equipment is started at the ending time point (namely the first time point) when the target object sends the wake-up audio to the voice interaction equipment; and ending the recording task at a second time point to obtain audio response data between the first time point and the second time point, wherein the time difference between the second time point and the first time point is a preset time threshold, and the audio response data comprises: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to a voice instruction sent by the target object; determining the interactive capability of the voice interactive equipment according to the obtained audio response data; through the scheme, the problems of long time consumption and low testing efficiency in the related technology are solved by manually measuring and analyzing the interactive performance index of the intelligent sound box; the intelligent sound box interaction performance index automatic test and analysis method has the advantages that the intelligent sound box interaction performance index automatic test and analysis method achieves the technical effect of improving test efficiency.

Optionally, the determining module 56 is further configured to, in a case where the interaction capability includes a wake-up time delay and an interaction time delay, perform noise reduction processing on the audio response data, and perform audio segmentation on the noise-reduced audio response data to obtain a plurality of audio segments of the audio response data; determining an audio fragment with the audio time length smaller than a preset time length in the plurality of audio fragments as a noise fragment, and deleting the noise fragment from the plurality of audio fragments to obtain a plurality of voice interaction fragments; and determining the interactive capability of the voice interaction equipment according to the starting time and the ending time of the voice interaction fragments in the audio response data.

Optionally, the determining module 56 is further configured to remove noise data in the audio response data according to a preset sound threshold, to obtain the noise-reduced audio response data, where the preset sound threshold is a maximum decibel value of noise in a preset test scene; and carrying out audio segmentation on the noise-reduced audio response data according to the mute segments to obtain the plurality of audio segments, wherein the mute segments are audio with the mute time length larger than a preset time threshold in the noise-reduced audio response data, and the plurality of audio segments do not contain the mute segments.

Optionally, the determining module 56 is further configured to determine a third time point and a fourth time point corresponding to the voice command, where the third time point is a start time point of the voice command, and the fourth time point is an end time point of the voice command; determining a difference between a starting time point of a first voice interaction segment and the first time point as the wake-up time delay, and determining a difference between a starting time point of a second voice interaction segment and the fourth time point as the interaction time delay, wherein the first voice interaction segment is a voice interaction segment between the first time point and the third time point in the voice interaction segments, and the second voice interaction segment is a voice interaction segment between the fourth time point and the second time point in the voice interaction segments.

Optionally, the determining module 56 is further configured to, in a case where the interaction capability includes a wake-up success rate, circularly execute a start step and an end step to obtain a plurality of audio response data, where the start step includes: starting a recording task of interactive voice between the target object and the voice interaction equipment at the first time point; the ending step comprises the following steps: ending the recording task at a second time point to obtain audio response data of the voice interaction equipment between the first time point and the second time point; and determining the wake-up success rate of the voice interaction equipment according to the wake-up states corresponding to the plurality of audio response data.

Optionally, the determining module 56 is further configured to determine that the wake-up state corresponding to the audio response data is wake-up success if the first voice interaction segment exists; determining the number of first audio response data with the wake-up state being successful in wake-up in the plurality of audio response data; and determining the wake-up success rate according to the quantity of the plurality of first audio response data.

Optionally, the starting module 52 is further configured to obtain a plurality of historical audio response data of the voice interaction device before starting the recording task of the interactive voice between the target object and the voice interaction device at the first time point, where an awake state corresponding to the plurality of historical audio response data is that the waking is successful; and determining an average value of the audio time durations of the plurality of historical audio response data as the preset time threshold.

An embodiment of the present invention also provides a storage medium including a stored program, wherein the program executes the method of any one of the above.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:

s1, starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends out wake-up audio to the voice interaction equipment;

s2, ending the recording task at a second time point to obtain audio response data of the voice interaction device between the first time point and the second time point, wherein a time difference value between the second time point and the first time point is a preset time threshold value, and the audio response data comprises: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object;

and S3, determining the interaction capability of the voice interaction equipment according to the audio response data.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining interactive capability, comprising:

starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends wake-up audio to the voice interaction equipment;

ending the recording task at a second time point to obtain audio response data of the voice interaction device between the first time point and the second time point, wherein a time difference value between the second time point and the first time point is a preset time threshold value, and the audio response data comprises: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object;

And determining the interaction capability of the voice interaction equipment according to the audio response data.

2. The method of determining interactive capability according to claim 1, wherein determining interactive capability of the voice interaction device from the audio response data comprises:

under the condition that the interaction capability comprises wake-up time delay and interaction time delay, carrying out noise reduction processing on the audio response data, and carrying out audio segmentation on the noise-reduced audio response data to obtain a plurality of audio fragments of the audio response data;

determining an audio fragment with the audio time length smaller than a preset time length in the plurality of audio fragments as a noise fragment, and deleting the noise fragment from the plurality of audio fragments to obtain a plurality of voice interaction fragments; and determining the interactive capability of the voice interaction equipment according to the starting time and the ending time of the voice interaction fragments in the audio response data.

3. The method for determining interactive capability according to claim 2, wherein performing noise reduction processing on the audio response data and performing audio segmentation on the noise-reduced audio response data to obtain a plurality of audio segments of the audio response data comprises:

Removing noise data in the audio response data according to a preset sound threshold value to obtain the noise-reduced audio response data, wherein the preset sound threshold value is the maximum decibel value of noise in a preset test scene;

and carrying out audio segmentation on the noise-reduced audio response data according to the mute segments to obtain the plurality of audio segments, wherein the mute segments are audio with the mute time length larger than a preset time threshold in the noise-reduced audio response data, and the plurality of audio segments do not contain the mute segments.

4. The method for determining the interactive capability according to claim 2, wherein the audio response data further comprises a voice command issued by the target object, and wherein determining the interactive capability of the voice interaction device according to the start time and the end time of the plurality of voice interaction fragments in the audio response data comprises:

determining a third time point and a fourth time point corresponding to the voice instruction, wherein the third time point is a starting time point of the voice instruction, and the fourth time point is an ending time point of the voice instruction;

determining a difference between a starting time point of a first voice interaction segment and the first time point as the wake-up time delay, and determining a difference between a starting time point of a second voice interaction segment and the fourth time point as the interaction time delay, wherein the first voice interaction segment is a voice interaction segment between the first time point and the third time point in the voice interaction segments, and the second voice interaction segment is a voice interaction segment between the fourth time point and the second time point in the voice interaction segments.

5. The method of any of claims 1-4, wherein determining the interactive capabilities of the voice interaction device based on the audio response data comprises:

and under the condition that the interaction capability comprises a wake-up success rate, circularly executing a starting step and an ending step to obtain a plurality of audio response data, wherein the starting step comprises the following steps: starting a recording task of interactive voice between the target object and the voice interaction equipment at the first time point; the ending step comprises the following steps: ending the recording task at a second time point to obtain audio response data of the voice interaction equipment between the first time point and the second time point;

and determining the wake-up success rate of the voice interaction equipment according to the wake-up states corresponding to the plurality of audio response data.

6. The method for determining interactive capability according to claim 5, wherein determining a wake-up success rate of the voice interaction device according to wake-up states corresponding to the plurality of audio response data comprises:

under the condition that the first voice interaction fragment exists, determining that the wake-up state corresponding to the audio response data is wake-up success;

Determining the number of first audio response data with the wake-up state being successful in wake-up in the plurality of audio response data;

and determining the wake-up success rate according to the quantity of the plurality of first audio response data.

7. The method for determining interactive capability according to claim 1, wherein before starting the recording task of the interactive voice between the target object and the voice interaction device at the first point in time, the method further comprises: acquiring a plurality of historical audio response data of the voice interaction device, wherein the wake-up state corresponding to the historical audio response data is wake-up success;

and determining an average value of the audio time durations of the plurality of historical audio response data as the preset time threshold.

8. An apparatus for determining interactive capability, comprising:

the starting module is used for starting a recording task of interactive voice between a target object and voice interaction equipment at a first time point, wherein the first time point is an ending time point when the target object sends out wake-up audio to the voice interaction equipment;

the ending module is configured to end the recording task at a second time point, and obtain audio response data of the voice interaction device between the first time point and the second time point, where a time difference value between the second time point and the first time point is a preset time threshold, and the audio response data includes: a first response result of the voice interaction device to the wake-up audio, a response time of the voice interaction device to the wake-up audio, and a second response result of the voice interaction device to the voice instruction sent by the target object;

And the determining module is used for determining the interaction capability of the voice interaction equipment according to the audio response data.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.