CN117116297A

CN117116297A - Test method, test system, test processing device, test electronic device and test storage medium

Info

Publication number: CN117116297A
Application number: CN202311110143.3A
Authority: CN
Inventors: 章福瑜; 李坚涛
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-24

Abstract

The disclosure provides a testing method, a testing system, a testing processing device, an electronic device and a storage medium, relates to the technical field of data processing, and particularly relates to the technical field of voice assistants, intelligent voice and artificial intelligence. The specific implementation scheme is as follows: displaying a first waveform in a target interface in response to audio signal monitoring for test audio; wherein the first waveform is used to characterize an audio signal variation of the test audio; displaying a second waveform in the target interface in response to the audio signal monitoring for reply audio; wherein the second waveform is used to characterize an audio signal variation of the reply audio; performing key point identification on the first waveform and the second waveform to obtain a first identification result; and determining the end-to-end voice response delay based on the first recognition result. Therefore, through the scheme, the end-to-end voice response delay can be accurately calculated.

Description

Test method, test system, test processing device, test electronic device and test storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to the field of voice assistants, intelligent voice, and artificial intelligence technologies, and in particular, to a testing method, a testing system, a testing device, an electronic device, and a storage medium.

Background

End-to-end voice response delay is an important indicator of the performance of voice interactive devices, which directly affects the user experience. Wherein, the voice interaction program is installed in the voice interaction device, and the end-to-end voice response delay is: and characterizing response delay between the voice interaction device receiving the test audio and starting to play the response audio corresponding to the test audio.

Disclosure of Invention

The present disclosure provides a test method, system, processing device, electronic device, and storage medium.

According to an aspect of the present disclosure, there is provided a test method including:

displaying a first waveform in a target interface in response to audio signal monitoring for test audio; wherein the first waveform is used to characterize an audio signal variation of the test audio;

displaying a second waveform in the target interface in response to the audio signal monitoring for reply audio; the second waveform is used for representing the audio signal change of the response audio, and the second waveform and the first waveform are waveforms under the same time axis;

Performing key point identification on the first waveform and the second waveform to obtain a first identification result;

and determining the end-to-end voice response delay based on the first recognition result.

According to another aspect of the present disclosure, there is provided a test system comprising:

test equipment, playing equipment and processing equipment with oscillography function;

the test equipment is equipment provided with a voice interaction program, a first probe of the processing equipment is connected with a first preset output port of a control chip of the playing equipment, and a second probe of the processing equipment is connected with a second preset output port of the control chip of the test equipment;

the playing device is used for sending out test audio by calling the first preset output port;

the test equipment is used for carrying out voice interaction on the test audio and playing the response audio of the test audio by calling the second preset output port;

the processing device, responsive to the audio signal monitoring for the test audio, displaying a first waveform in the target interface; displaying a second waveform in the target interface in response to the audio signal monitoring for reply audio; an end-to-end voice response delay is determined based on the keypoints in the first waveform and the keypoints in the second waveform.

According to another aspect of the present disclosure, there is provided a processing apparatus including:

the oscillography display module is used for responding to the audio signal monitoring of the test audio sent by the playing device and displaying a first waveform in the target interface; displaying a second waveform in the target interface in response to audio signal monitoring of the reply audio emitted by the test device; wherein the first waveform is used to characterize an audio signal variation of the test audio; the response audio is a response result of a voice interaction program in the test equipment for voice interaction of the test audio, the second waveform is used for representing audio signal change of the response audio, and the second waveform and the first waveform are waveforms under the same time axis;

the data processing module is used for carrying out key point identification on the first waveform and the second waveform to obtain a first identification result; and determining an end-to-end voice response delay based on the first recognition result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the above-described test methods.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the test method according to any one of the above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a test method according to any of the above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a voice interaction device processing audio in accordance with the present disclosure;

FIG. 2 is a flow chart of a test method according to the present disclosure;

FIG. 3 is another flow chart of a test method according to the present disclosure;

FIG. 4 is a schematic diagram of a test system according to the present disclosure;

FIG. 5 is a schematic diagram of the connection of various devices according to a specific example of a test system of the present disclosure;

FIG. 6 is a schematic diagram of a target interface of an oscilloscope of the present disclosure;

FIG. 7 is a schematic diagram of a test apparatus according to the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a test method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the prior art, an audio file with a manual annotation is simulated and canned in a voice interaction device for testing, namely, after the voice end position in the audio file is manually annotated, the audio file is directly transferred to a voice interaction program of the voice interaction device for voice recognition. And when the labeling position is identified, and the voice interaction program generates response audio corresponding to the audio in the audio file, performing log dotting. Thereby taking the time difference between the dotting positions as the end-to-end voice response delay of the voice interaction device.

However, this approach ignores the uplink delay of the audio, i.e., the time it takes to receive audio from an audio input device (e.g., microphone) until the audio is available for use by the voice interaction program; and ignoring the delay of the downlink of TTS (Text To Speech) announcements, i.e. the time that elapses between the generation of the response audio and the playback of the response audio through the headphone jack or the built-in speaker. The method only counts the program processing time of the voice interaction program for processing the audio frequency, and has errors with the end-to-end voice response delay actually experienced by the user.

Note that, the delay of the uplink of the audio is also called audio input delay, and the delay of the downlink of the TTS broadcast is also called audio output delay. The end-to-end voice response delay that the user actually experiences, i.e., the audio loop back delay, is the sum of the audio input delay, the program processing time, and the audio output delay, as shown in fig. 1.

Based on the foregoing, in order to accurately calculate an end-to-end voice response delay, embodiments of the present disclosure provide a test method, a system, a processing device, an electronic device, and a storage medium.

Next, a test method provided by the embodiments of the present disclosure will be first described.

The test method provided by the embodiment of the disclosure can be applied to processing equipment with an oscillometric function, such as an oscilloscope. It should be noted that the embodiment of the present disclosure is not limited to the processing device, and the processing device with the oscillometric function may be other electronic devices with the oscillometric function and the data processing capability. In practice, the processing device may include an oscillometric display module for displaying waveforms, and a data processing module with data processing capabilities.

The test method provided by the embodiment of the disclosure may include the following steps:

According to the testing method provided by the embodiment of the invention, the first waveform is displayed in the target interface in response to the monitoring of the audio signal of the test audio, and the second waveform is displayed in the target interface in response to the monitoring of the audio signal of the response audio. Wherein the first waveform is used to characterize the audio signal variation of the test audio and the second waveform is used to characterize the audio signal variation of the reply audio. Because the first waveform and the second waveform are waveforms in the same time axis, and the first waveform and the second waveform reflect the change of the audio signal, the first waveform and the second waveform can be subjected to key point recognition to obtain a first recognition result, and then the end-to-end voice response delay is determined based on the first recognition result. Compared with the prior art, the end-to-end voice response delay calculated by the scheme is the audio loop delay actually experienced by the user. Therefore, through the scheme, the end-to-end voice response delay can be accurately calculated.

A test method provided by an embodiment of the present disclosure is described below with reference to the accompanying drawings.

As shown in fig. 2, the test method provided by the embodiment of the disclosure may include the following steps:

s201, in response to audio signal monitoring for test audio, displaying a first waveform in a target interface; wherein the first waveform is used to characterize an audio signal variation of the test audio;

in this embodiment, the processing device with the oscillometric function may monitor the audio signal of the test audio and display a first waveform representing a change in the audio signal of the test audio in the target interface. In practical applications, the target interface is a waveform display interface of the processing device.

For example, the test audio may be played by a playing device, which may be a terminal device such as a mobile phone, a computer, or the like. The output port corresponding to the audio processor in the control chip of the playing device is used for sending out test audio, the output port can be used as a first preset output port, and at the moment, the processing device can be connected with the first preset output port through the voltage probe so as to collect the electric signals output by the output port. It should be noted that, an implementation manner of the processing device for performing audio signal monitoring on the test audio is described in the following system embodiments, which are not described herein.

It can be understood that, since the electrical signal output by the first predetermined output port has a voltage value variation when the test audio is played and does not have a voltage value variation when the test audio is not played, the processing device can convert the electrical signal obtained by monitoring into a waveform for displaying, and therefore, a first waveform representing the audio signal variation of the test audio can be displayed in the target interface.

S202, responding to audio signal monitoring of response audio, and displaying a second waveform in the target interface; the second waveform is used for representing the audio signal change of the response audio, and the second waveform and the first waveform are waveforms under the same time axis;

in this embodiment, the processing device with the oscillometric function may monitor the audio signal of the response audio and display a second waveform representing the change of the audio signal of the response audio in the target interface. In practical applications, the second waveform may be a waveform located in a different display area along the same time axis as the first waveform. An exemplary target interface is shown in fig. 6, in which the uppermost waveform is a first waveform, the lowermost waveform is a second waveform, the horizontal axes of the first waveform and the second waveform are time axes, and in which the first waveform and the second waveform are waveforms on the same time axis.

For example, the response audio may be played using a test device, which may be a voice interaction device installed with a voice interaction program, such as a smart speaker, a car terminal, a smart phone, or the like. The voice interactive program can be any voice assistant with voice interactive functions. It can be understood that after the test device collects the test audio played by the playing device, the voice interaction program in the test device will respond to the test audio to generate response audio and broadcast the response audio to complete voice interaction of the test audio. The output port corresponding to the audio processor in the control chip of the test device is used for sending response audio, the output port can be used as a second preset output port, and at the moment, the processing device can be connected with the second preset output port through the voltage probe so as to collect the electric signal output by the output port. It should be noted that, an implementation manner of the processing device for performing audio signal monitoring on the response audio is described in the following system embodiments, which are not described herein.

It can be understood that, since the electrical signal output by the second predetermined output port has a voltage value change when the response audio is played and does not have a voltage value change when the response audio is not played, the processing device can convert the electrical signal obtained by monitoring into a waveform for display, and therefore, a second waveform representing the audio signal change of the response audio can be displayed in the target interface.

S203, performing key point identification on the first waveform and the second waveform to obtain a first identification result;

in this embodiment, in order to determine the end-to-end voice response delay, the first waveform and the second waveform may be subjected to key point recognition, so as to determine the key point representing that the test audio is played completely from the first waveform, and determine the key point representing that the answer audio starts to be played from the second waveform, as the first recognition result.

Of course, on the premise that the end-to-end voice response delay can be determined, the key point identification for the first waveform and the second waveform is not limited to the key point for representing the completion of the test audio playing and the key point for representing the beginning of the response audio playing.

S204, based on the first recognition result, determining the end-to-end voice response delay.

In this embodiment, after the first recognition result is obtained, since the first waveform and the second waveform are waveforms in the same time axis, the end-to-end voice response delay can be determined based on the difference of the key points included in the first recognition result in the time axis. For example, if in the first recognition result, the time on the time axis corresponding to the key point indicating that the playing of the test audio is completed is 20s, the time on the time axis corresponding to the key point indicating that the playing of the response audio is started is 22s, and the end-to-end voice response delay is 2s.

Optionally, in an implementation manner, performing the key point recognition on the first waveform and the second waveform to obtain a first recognition result may include:

identifying a first keypoint in the first waveform and a second keypoint in the second waveform; the first key point is the position point of the last high-level signal corresponding to the first waveform, and the second key point is the position point of the first high-level signal corresponding to the second waveform.

It can be understood that, because the electrical signal output by the output port corresponding to the playing device has a voltage value change when the test audio is played, the position point of the last high-level signal corresponding to the first waveform can be regarded as the position point where the voltage value change ends, i.e. the position point when the test audio is played, and therefore, the position point of the last high-level signal corresponding to the first waveform can be regarded as the first key point. Similarly, the position point of the first high-level signal corresponding to the second waveform can be regarded as the position point where the voltage value starts to change, i.e. the position point where the response audio starts to play, so that the position point of the first high-level signal corresponding to the second waveform can be regarded as the second key point.

The identifying manner of the first key point and the second key point may be that image analysis is performed on the first waveform and the second waveform, and the first key point in the first waveform and the second key point in the second waveform are identified by using a neural network model that is trained in advance; or, other ways of identifying the first key point and the second key point are possible, for example, sampling the first waveform and the second waveform at preset time intervals to obtain each position point, taking the position point with the latest time on the corresponding time axis, which is corresponding to the voltage value exceeding the preset threshold value, of the position points with the latest time on the corresponding time axis, of the first waveform, as the first key point, and taking the position point with the earliest time on the corresponding time axis, which is corresponding to the voltage value exceeding the threshold value, of the position points with the second waveform, as the second key point.

Accordingly, in the present implementation, determining the end-to-end voice response delay of the test device based on the first recognition result may include steps A1-A2:

a1, determining a difference value between the time on the time axis corresponding to the first key point and the time on the time axis corresponding to the second key point to obtain a first time difference;

It should be noted that, a specific implementation manner of determining the difference between the time on the time axis corresponding to the first key point and the time on the time axis corresponding to the second key point may be: performing image analysis on the first waveform and the second waveform in the target interface, and calculating to obtain a first time difference based on the difference value of the transverse image coordinates of the first key point and the second key point and the conversion relation between the preset image coordinate difference value and the time difference value; or after the first key point and the second key point are determined, determining the corresponding time points of the first key point and the second key point respectively based on the corresponding relation between each position point and the time point on the time axis, and calculating the difference value to obtain the first time difference, which is reasonable.

A2, calculating the end-to-end voice response delay based on the first time difference.

In this implementation manner, since the first waveform and the second waveform are waveforms on the same time axis, a difference between the time on the time axis corresponding to the first key point and the time on the time axis corresponding to the second key point can be calculated as the first time difference. Because the first key point is the key point representing that the test audio is completely played, and the second key point is the key point representing that the response audio starts to be played, after the first time difference is calculated, the first time difference can be used as the end-to-end voice response delay. In addition, in other implementations, the first time difference may be corrected using a predetermined correction coefficient to obtain an end-to-end voice response delay. The predetermined correction coefficient may be a coefficient determined based on the influence of different test environment temperatures or different electric quantities of the test equipment on the response speed of the test equipment, and the predetermined correction coefficient of different types of test equipment may be different.

Optionally, in another embodiment of the present disclosure, on the basis of the embodiment shown in fig. 2, as shown in fig. 3, the method may further include steps S301 to S303:

s301, responding to monitoring of a voice interaction process for the test audio, and displaying a third waveform in the target interface; the third waveform is used for representing signal change in the voice interaction process, and the third waveform, the first waveform and the second waveform are waveforms in the same time axis when the text recognition result of the test audio is generated and the response audio is generated in the voice interaction process;

In this embodiment, the processing device with the oscillometric function may further monitor a voice interaction process of the test audio, and display a third waveform representing a signal change in the voice interaction process in the target interface. The third waveform may be a waveform of a different presentation area located at the same time axis as the first waveform and the second waveform. An exemplary target interface is shown in fig. 6, in which the uppermost waveform is a first waveform, the lowermost waveform is a second waveform, the middle waveform is a third waveform, and the horizontal axes of the first, second and third waveforms are time axes, and in which the first, second and third waveforms are waveforms in the same time axis.

It is worth mentioning that the voice interaction process for the test audio may include: the test equipment carries out text recognition on the test audio to generate a text recognition result of the test audio, generates response audio corresponding to the test audio according to the text recognition result, and plays the response audio through a headset jack or a built-in loudspeaker. For example, as shown in fig. 1, the test device may collect test audio through the audio input device, and after the test device collects the test audio, the test audio may be transferred to a voice interaction program of the test device, so that the voice interaction program performs response processing on the test audio, and response audio corresponding to the test audio is obtained. At this time, the processing device may also be connected to a predetermined test port of the test device through the voltage probe, so as to collect an electrical signal output by the predetermined test port. In practical applications, the predetermined test port may be any free GPIO (General purpose input/output) port of the control chip of the test device. The tester can set the debugging function of the test equipment in advance, namely, the test equipment responds to the text recognition result of the generated test audio to adjust the port state of the preset test port to be in a high level state, and responds to the generated response audio to adjust the port state of the preset test port to be in a high level state. And after the port state of the predetermined test port is adjusted to be in a high level state, the port state is recovered, namely, recovered to be in a low level state. It can be understood that after the debugging function of the test device is set, the high level can be pulled up when the test device generates the text recognition result of the test audio and when the response audio corresponding to the test audio is generated, so as to acquire and obtain the high level signal output by the predetermined test port. Thus, the processing device may display a third waveform in the target interface in response to monitoring of the voice interaction process for the test audio.

S302, performing key point identification on the third waveform to obtain a second identification result;

the third waveform is used for representing the signal change in the voice interaction process, and the signal change is generated when the text recognition result of the test audio is generated in the voice interaction process and the response audio is generated, so that the position point where the signal change occurs in the third waveform can be recognized as a key point, and a second recognition result is obtained.

S303, determining the voice input delay and/or the voice output delay based on the second recognition result and the first recognition result.

It can be understood that, because the first recognition result includes the key points for representing that the test audio is played and the key points for representing that the response audio starts to be played; the second recognition result comprises key points for representing the completion of the generation of the text recognition result of the test audio and key points for representing the completion of the generation of the response audio; the voice input delay is the time from the collection of the test audio to the generation of the text recognition result of the test audio for the voice interaction program to use; the speech output delay is the time elapsed between the generation of the response audio and the playback of the response audio through the headphone jack or the built-in speaker; thus, the voice input delay and/or the voice output delay can be determined according to the time corresponding to each key point in the second recognition result and the first recognition result.

Optionally, in an implementation manner, performing the key point recognition on the third waveform to obtain the second recognition result may include:

identifying a third keypoint and a fourth keypoint in the third waveform; the third key point is a position point of a first high-level signal corresponding to the third waveform, and the fourth key point is a position point of a last high-level signal corresponding to the third waveform.

In practical application, a position point of the third waveform, where the voltage value corresponding to the position point exceeds the preset threshold, may be identified as a key point, and a position point of the first high level signal and a position point of the second high level signal are obtained and used as a third key point and a fourth key point, respectively. In the voice interaction process, after the text recognition result of the test audio is generated, the response audio corresponding to the test audio is generated, so that the third key point is a key point for representing the text recognition result of the test audio, and the fourth key point is a key point for representing the response audio corresponding to the test audio after the generation is completed.

Accordingly, in the present implementation, the determination method of the voice input delay may include steps B1-B2:

B1, under the condition that the first identification result comprises a first key point, determining a difference value between the time on a time axis corresponding to the third key point and the time on the time axis corresponding to the first key point to obtain a second time difference;

it should be noted that, the manner of determining the second time difference may be similar to the manner of determining the first time difference in the above step A1, which is not described herein.

B2, determining the voice input delay based on the second time difference.

Since the first waveform and the third waveform are waveforms in the same time axis, when the first recognition result includes the first key point, a difference between the time on the time axis corresponding to the third key point and the time on the time axis corresponding to the first key point can be calculated as the second time difference. Because the third key point is a key point for representing the text recognition result of the generated test audio, the first key point is a key point for representing the completion of the playing of the test audio, and the voice input delay is the time from the collection of the test audio to the generation of the text recognition result of the test audio, so that the second time difference can be used as the voice input delay.

In other implementations, the second time difference may be corrected using a predetermined correction coefficient to obtain the voice input delay. The predetermined correction factor may be a factor determined based on the influence of different test environment temperatures or different amounts of power of the test device on the speed of the voice input by the test device.

Accordingly, in the present implementation, the determination method of the voice output delay may include steps C1-C2:

c1, determining a difference value between the time on a time axis corresponding to a second key point and the time on a time axis corresponding to a fourth key point to obtain a third time difference when the first identification result comprises the second key point;

it should be noted that, the manner of determining the third time difference may be similar to the manner of determining the first time difference in the above step A1, which is not described herein.

C2, determining a speech output delay based on the third time difference.

Since the second waveform and the third waveform are waveforms in the same time axis, when the first recognition result includes the second key point, a difference between the time on the time axis corresponding to the second key point and the time on the time axis corresponding to the fourth key point can be calculated as the third time difference. Because the fourth key point is a key point representing the response audio corresponding to the generated test audio, the second key point is a key point representing the beginning of playing the response audio, and the voice output delay is the time elapsed between the generation of the response audio and the playing of the response audio through the earphone jack or the built-in speaker, the third time difference can be used as the voice output delay.

In other implementations, the third time difference may be corrected using a predetermined correction coefficient to obtain the speech output delay. The predetermined correction coefficient may be a coefficient determined based on the influence of different test environment temperatures or different electric quantities of the test device on the voice output speed caused by the test device.

In addition, it should be noted that, according to the relationship among the first key point, the second key point, the third key point and the fourth key point, other time consuming periods may also be calculated, for example, time consuming periods from playing the test audio to generating the response audio corresponding to the test audio, time consuming periods from generating the text recognition result to playing the response audio, and so on. Therefore, in practical application, the time consumption of different stages in the voice interaction process can be counted according to service requirements.

Therefore, according to the scheme, on the basis of accurately counting the end-to-end voice response delay, the voice input delay and/or the voice output delay can be counted, so that the time consumption of each step of voice interaction of the test equipment can be obtained.

Fig. 4 is a schematic structural diagram of a test system according to an embodiment of the disclosure, as shown in fig. 4, the test system may include: test device 410, play device 420, and processing device 430 with oscillometric functionality;

The test device 410 is a device with a voice interaction program, a first probe of the processing device 430 is connected to a first predetermined output port of a control chip of the playing device 420, and a second probe of the processing device 430 is connected to a second predetermined output port of the control chip of the test device 410;

the playing device 420 is configured to send out test audio by calling the first predetermined output port;

the test device 410 is configured to perform voice interaction with respect to the test audio, and play a response audio with respect to the test audio by calling the second predetermined output port;

the processing device 430, responsive to the audio signal monitoring for the test audio, displays a first waveform in the target interface; displaying a second waveform in the target interface in response to the audio signal monitoring for reply audio; an end-to-end voice response delay is determined based on the keypoints in the first waveform and the keypoints in the second waveform.

In this embodiment, the first predetermined output port is an output port corresponding to an audio processor in a control chip of the playing device, and the second predetermined output port is an output port corresponding to the audio processor in a control chip of the testing device. By calling the first predetermined output port or the second predetermined output port, audio can be output. For example, the playback device may call the first predetermined output port to output test audio with the content of "today's weather" to a speaker of the playback device to play the test audio. At this time, the test device can collect the test audio through the microphone, then the test device transmits the test audio to the voice interaction program, and the voice interaction program responds to the test audio, so that the answer audio with the content of ' today ' weather is clear '. Then, the testing device outputs the response audio to the loudspeaker of the testing device by calling the second preset output port for playing.

The control Chip may be, for example, a SoC (System-on-a-Chip) Chip. Among them, the SoC chip is an integrated circuit chip that integrates a plurality of functional modules and components onto one chip. A complete SoC chip typically includes a CPU (Central Processing Unit ), GPU (graphics processing unit, graphics processor), memory controller, input-output interface, communication module, audio processor, etc. At this time, the first predetermined output port and the second predetermined output port may be output ports corresponding to output pins of the audio processor among pins of the SoC chip. It should be noted that the control chip may be another type of system chip integrated with at least an audio processor, and the first predetermined output port and the second predetermined output port are output ports corresponding to output pins of the audio processor in the control chip.

It can be understood that, by connecting a probe of the processing device with an output pin, connection of the probe with an output port corresponding to the output pin is achieved, so that the processing device is connected with the output port through the voltage probe, and thus, an electric signal output by the output port can be collected. Thus, by connecting the first probe of the processing device with the first predetermined output port and the second probe with the second predetermined output port, the processing device can monitor the port state of the first predetermined output port through the first probe to obtain a first electrical signal, and monitor the port state of the second predetermined output port through the second probe to obtain a second electrical signal.

It can be understood that when the first predetermined output port is called to play the test audio, that is, when the first predetermined output port outputs the test audio to the speaker, the port state of the first predetermined output port has a voltage value change; when the test audio is not played, the port state of the first preset output port does not have voltage value change. The processing device obtains a first waveform and a second waveform by converting the obtained first electrical signal and second electrical signal into a visualized waveform pattern, and displays the first waveform and the second waveform in the target interface. It should be noted that, regarding the functions implemented by each device included in the test system, reference may be made to the related descriptions of the embodiments of the test method described above, which are not repeated herein.

Optionally, in an embodiment of the another test system provided in the present disclosure, on the basis of the embodiment shown in fig. 4, a third probe of the processing device is connected to a predetermined test port of a control chip of the test device;

the test equipment is further used for responding to a text recognition result of the generated test audio in the process of performing voice interaction on the test audio, and after the response audio is generated, enabling the port state of the preset test port to be in a high level state, and after the response audio is generated, recovering the port state of the preset test port;

The processing device is further configured to:

displaying a third waveform in the target interface in response to monitoring of the voice interaction process for the test audio; a speech input delay and/or a speech output delay is determined based on the keypoints in the third waveform and the keypoints in the first waveform and the second waveform.

In this embodiment, the third probe of the processing device is connected to the predetermined test port, so that the processing device can monitor the port state of the predetermined test port through the third probe to obtain the third electrical signal. The processing device obtains a third waveform by converting the obtained third electrical signal into a visualized waveform pattern. The predetermined test port may be any spare GPIO (General purpose input/output) port of the control chip. And the tester can set the debugging function of the test equipment in advance, namely, the test equipment sets the port state of the preset test port to be in a high-level state in response to the text recognition result of the generated test audio, and sets the port state of the preset test port to be in a high-level state in response to the response audio corresponding to the generated test audio. And after the port state of the predetermined test port is adjusted to be in a high level state, the port state is recovered, namely, recovered to be in a low level state. It can be understood that after the debugging function of the test device is set, the level can be pulled up when the test device generates the text recognition result of the test audio and when the response audio corresponding to the test audio is generated.

For example, in practical application, the operation of setting the port state of the predetermined test port to the high-level state may be performed by triggering a predetermined program instruction. For example, if the predetermined test port is a port corresponding to the GPIO434 pin, the program instruction may be: the adb shell "echo 1>/sys/class/gpio/gpio434/value".

It should be noted that, regarding the functions implemented by each device included in the test system, reference may be made to the related descriptions of the embodiments of the test method described above, which are not repeated herein.

For a better understanding of the present solution, the test system provided by the embodiments of the present disclosure will be described below with reference to a specific example.

In order to solve the problem that it is difficult to calculate the time consumption of each step in the voice interaction process, the present example provides a test system including a playback device that plays back test audio, a test device that installs a voice interaction program, and an oscilloscope (corresponding to the processing device with oscillometric function above). Prior to testing, the relevant settings need to be made:

(1) As shown in fig. 5, an oscilloscope probe a (corresponding to the first probe above) is used to connect to an output pin (corresponding to the first predetermined output port above) of the SoC chip of the playback device;

(2) Connecting an output pin (corresponding to the second preset output port) of the SoC chip of the test equipment by using an oscilloscope probe B (corresponding to the second probe);

(3) Connecting the GPIO434 pin of the test device (corresponding to the predetermined test port hereinabove, adjustable according to the GPIO pin actually available) using oscilloscope probe C (corresponding to the third probe hereinabove);

(4) When the test equipment generates a text recognition result of the test audio and generates a response audio corresponding to the test audio, the test equipment triggers a preset program instruction to pull up the level of the GPIO 434. The predetermined program instructions are: the adb shell "echo 1>/sys/class/gpio/gpio434/value".

The test steps are as follows:

(1) And using the playing equipment to play the test audio, wherein the audio content is XX, and the window is opened, and XX is a voice program wake-up word of a voice interaction program in the testing equipment.

(2) The oscilloscope probe a monitors the electrical signal of the output pin of the SoC chip of the playback device (corresponding to the first electrical signal above) and presents a waveform (corresponding to the first waveform above) on the display interface of the oscilloscope (corresponding to the target interface above). As shown in fig. 6, the uppermost waveform in the figure is the first waveform, and the time corresponding to the position point (corresponding to the first key point above) of the last high level signal indicating that the "open window" has been played out is denoted as T1, as the position point of the "open window" having been played out.

(3) And the microphone of the test equipment receives the test audio, and after the voice interaction program is successfully awakened, text recognition is performed, and response audio corresponding to the test audio is generated based on a text recognition result obtained by the text recognition.

(4) When the voice interaction program receives a text recognition result obtained by recognizing the test audio by the recognition engine and generates response audio corresponding to the test audio, triggering a preset program instruction to pull up the level of the GPIO434 pin, and recovering the level state after the level is pulled up to be recovered to be in a low level. The oscilloscope probe C monitors the electrical signal at the GPIO434 pin of the test equipment, and a waveform (corresponding to the third waveform above) is shown on the oscilloscope. As shown in fig. 6, the waveform at the middle position in the figure is a third waveform, in which the time corresponding to the position point of the first high level signal (corresponding to the third key point above) is denoted as T2, and the time corresponding to the position point of the last high level signal (corresponding to the fourth key point above) is denoted as T3.

(5) The test equipment starts to broadcast response audio corresponding to the test audio. The oscilloscope probe B monitors the electrical signal of the output pin of the SoC chip of the playback device (corresponding to the second electrical signal above) and displays a waveform (corresponding to the second waveform above) on the oscilloscope, as shown in fig. 6, the lowest waveform in the diagram is the second waveform, and the time corresponding to the position point (corresponding to the second key point above) of the first high-level signal in the waveform is denoted as T4.

(6) The oscilloscope is paused and the time consuming of each stage is analyzed.

Wherein the time taken to say "open window" until the text recognition result is recognized is T2-T1 (corresponding to the voice input delay above); the time spent speaking "open window" to play the response audio (corresponding to the end-to-end voice response delay above) is T4-T1; the time taken to generate the response audio to begin playing the response audio is T4-T3.

Therefore, the scheme can count the difficulties from the time of the user speaking to the time of generating the text recognition result and the time of the user speaking to the equipment. The time consumption of each step of voice response by analyzing the test equipment can be favorable for better improving the voice response speed subsequently so as to improve the product use experience of users.

Accordingly, in the foregoing embodiment of the test method, the embodiment of the present disclosure further provides a processing device, as shown in fig. 7, may include:

an oscillometric display module 710 for displaying a first waveform in a target interface in response to audio signal monitoring for test audio emitted by a playback device; displaying a second waveform in the target interface in response to audio signal monitoring of the reply audio emitted by the test device; wherein the first waveform is used to characterize an audio signal variation of the test audio; the response audio is a response result of a voice interaction program in the test equipment for voice interaction of the test audio, the second waveform is used for representing audio signal change of the response audio, and the second waveform and the first waveform are waveforms under the same time axis;

The data processing module 720 is configured to identify key points of the first waveform and the second waveform, so as to obtain a first identification result; and determining an end-to-end voice response delay based on the first recognition result.

Optionally, the data processing module performs key point recognition on the first waveform and the second waveform to obtain a first recognition result, including:

identifying a first keypoint in the first waveform and a second keypoint in the second waveform;

the first key point is the position point of the last high-level signal corresponding to the first waveform, and the second key point is the position point of the first high-level signal corresponding to the second waveform.

Optionally, the data processing module determines an end-to-end voice response delay based on the first recognition result, including:

determining a difference value between the time on the time axis corresponding to the first key point and the time on the time axis corresponding to the second key point to obtain a first time difference;

based on the first time difference, an end-to-end voice response delay is calculated.

Optionally, the oscillometric display module is further configured to display a third waveform in the target interface in response to monitoring of the test equipment for the voice interaction process of the test audio; the third waveform is used for representing signal change in the voice interaction process, and the third waveform, the first waveform and the second waveform are waveforms in the same time axis when the text recognition result of the test audio is generated and the response audio is generated in the voice interaction process;

The data processing module is further configured to:

performing key point identification on the third waveform to obtain a second identification result;

a speech input delay and/or a speech output delay is determined based on the second recognition result and the first recognition result.

Optionally, the data processing module performs key point recognition on the third waveform to obtain a second recognition result, including:

identifying a third keypoint and a fourth keypoint in the third waveform;

the third key point is a position point of a first high-level signal corresponding to the third waveform, and the fourth key point is a position point of a last high-level signal corresponding to the third waveform.

Optionally, the determining manner of the voice input delay by the data processing module includes:

under the condition that the first recognition result comprises the first key point, determining a difference value between the time on the time axis corresponding to the third key point and the time on the time axis corresponding to the first key point to obtain a second time difference;

based on the second time difference, a voice input delay is determined.

Optionally, the determining manner of the voice output delay by the data processing module includes:

Determining a difference value between the time on the time axis corresponding to the second key point and the time on the time axis corresponding to the fourth key point to obtain a third time difference when the first identification result comprises the second key point;

based on the third time difference, a speech output delay is determined.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

An electronic device provided by the present disclosure may include:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of any one of the test methods described above.

The present disclosure provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the steps of any of the test methods described above.

In yet another embodiment provided by the present disclosure, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the steps of any of the test methods of the above embodiments.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as test methods. For example, in some embodiments, the test method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the test method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the test method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on chip (socs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of testing, comprising:

2. The method of claim 1, wherein the performing the keypoint identification on the first waveform and the second waveform to obtain a first identification result comprises:

3. The method of claim 2, wherein the determining an end-to-end voice response delay of the test device based on the first recognition result comprises:

4. A method according to any one of claims 1-3, further comprising:

displaying a third waveform in the target interface in response to monitoring of the voice interaction process for the test audio; the third waveform is used for representing signal change in the voice interaction process, and the third waveform, the first waveform and the second waveform are waveforms in the same time axis when the text recognition result of the test audio is generated and the response audio is generated in the voice interaction process;

5. The method of claim 4, wherein the performing the keypoint identification on the third waveform to obtain a second identification result comprises:

identifying a third keypoint and a fourth keypoint in the third waveform;

6. The method of claim 5, wherein the manner of determining the voice input delay comprises:

under the condition that the first recognition result comprises a first key point, determining a difference value between the time on the time axis corresponding to the third key point and the time on the time axis corresponding to the first key point to obtain a second time difference;

based on the second time difference, a voice input delay is determined.

7. The method of claim 5, wherein the manner in which the speech output delay is determined comprises:

under the condition that the first identification result comprises a second key point, determining a difference value between the time on a time axis corresponding to the second key point and the time on a time axis corresponding to a fourth key point to obtain a third time difference;

Based on the third time difference, a speech output delay is determined.

8. A test system, comprising:

9. The system of claim 8, wherein the third probe of the processing device is connected to a predetermined test port of the control chip of the test device;

the processing device is further configured to:

10. A processing apparatus, comprising:

11. The processing device of claim 10, wherein the data processing module performs keypoint identification on the first waveform and the second waveform to obtain a first identification result, and comprises:

12. The processing device of claim 11, wherein the data processing module to determine an end-to-end voice response delay based on the first recognition result comprises:

13. The processing device of any of claims 10-12, wherein the oscillometric display module is further configured to display a third waveform in the target interface in response to monitoring of the test device for the voice interaction process of the test audio; the third waveform is used for representing signal change in the voice interaction process, and the third waveform, the first waveform and the second waveform are waveforms in the same time axis when the text recognition result of the test audio is generated and the response audio is generated in the voice interaction process;

the data processing module is further configured to:

14. The processing device of claim 13, wherein the data processing module performs keypoint identification on the third waveform to obtain a second identification result, comprising:

identifying a third keypoint and a fourth keypoint in the third waveform;

15. The processing device of claim 14, wherein the manner in which the data processing module determines the voice input delay comprises:

based on the second time difference, a voice input delay is determined.

16. The processing device of claim 14, wherein the manner in which the data processing module determines the speech output delay comprises:

based on the third time difference, a speech output delay is determined.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.