WO2021232710A1 - 用于全双工语音交互系统的测试方法及装置 - Google Patents

用于全双工语音交互系统的测试方法及装置 Download PDF

Info

Publication number
WO2021232710A1
WO2021232710A1 PCT/CN2020/129352 CN2020129352W WO2021232710A1 WO 2021232710 A1 WO2021232710 A1 WO 2021232710A1 CN 2020129352 W CN2020129352 W CN 2020129352W WO 2021232710 A1 WO2021232710 A1 WO 2021232710A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
log
audio
voice interaction
full
Prior art date
Application number
PCT/CN2020/129352
Other languages
English (en)
French (fr)
Inventor
石韡斯
樊帅
宋洪博
Original Assignee
思必驰科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 思必驰科技股份有限公司 filed Critical 思必驰科技股份有限公司
Priority to JP2022569085A priority Critical patent/JP2023526285A/ja
Priority to EP20936925.5A priority patent/EP4156175A4/en
Publication of WO2021232710A1 publication Critical patent/WO2021232710A1/zh
Priority to US17/990,149 priority patent/US20230077478A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • H04L5/14Two-way operation using the same type of signal, i.e. duplex
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/24Arrangements for supervision, monitoring or testing with provision for checking the normal operation
    • H04M3/241Arrangements for supervision, monitoring or testing with provision for checking the normal operation for stored program controlled exchanges
    • H04M3/242Software testing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • H04L5/0001Arrangements for dividing the transmission path
    • H04L5/0014Three-dimensional division
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of intelligent voice, and in particular to a testing method and device for a full-duplex voice interactive system.
  • each module on the voice interactive link can be tested independently. For example, through the wake-up/signal processing module, test the wake-up rate, wake-up time, power consumption, etc.; through the speech recognition module, test the sentence error rate, word error rate; through the semantic understanding module, test the accuracy, recall, and resolution accuracy; Score the speech synthesis module based on subjective evaluation of multiple persons.
  • Most existing voice interaction systems use half-duplex interaction. In half-duplex interaction, each module has an absolutely orderly dependency relationship, and the entire system can complete the interaction by calling each module serially. In this case, independent testing by module can meet the requirements.
  • test indicators Due to the independent testing of modules, the test indicators are different, and there is a lack of testing methods and evaluation indicators for the overall voice interaction system.
  • the indicators of each module can no longer meet the needs of evaluation. For example, for a half-duplex dialogue, the half-duplex voice interaction system will respond every time the user speaks a word. The full-duplex voice interactive system will only respond to valid requests.
  • an embodiment of the present invention provides a test method for a full-duplex voice interaction system, including:
  • the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text
  • the second log records the decision result for each corpus audio, and the decision includes response and discard;
  • the false response rate is determined based on the number of false responses and the total number of corpus audio played .
  • an embodiment of the present invention provides a test device for a full-duplex voice interaction system, including:
  • the corpus determination program module is used to mix the valid corpus related to the test scene with the invalid corpus not related to the test scene to determine the scene mixed corpus;
  • the test program module is used to play each corpus audio in the scene mixed corpus to the voice interaction device to be tested equipped with the full-duplex voice interaction system;
  • the log acquisition program module is used to acquire the work log of the voice interaction device to be tested, the work log includes at least a first log and a second log, wherein,
  • the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text
  • the second log records the decision result for each corpus audio, and the decision includes response and discard;
  • Rejection rate determination program module configured to determine the rejection rate based on the number of invalid corpus audio in the first log and the number of discarded results in the second log;
  • the error response rate determination program module is used to obtain the number of false responses based on the number of false responses and the number of false responses by counting the number of logs in the second log that are not expected to output the response decision result but actually received the response decision result The total number of corpus audio to determine the false response rate.
  • an electronic device which includes: at least one processor, and a memory communicatively connected with the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the test method for a full-duplex voice interaction system according to any embodiment of the present invention.
  • an embodiment of the present invention provides a storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the test for a full-duplex voice interaction system in any embodiment of the present invention is implemented Method steps.
  • the beneficial effect of the embodiment of the present invention is to realize the end-to-end test of the full-duplex voice interaction system, and accurately obtain the index for the unique characteristic of the full-duplex.
  • the full-duplex interactive system features are fully covered, the online interaction effect is improved, and the necessary data reproduction and indicator support are provided.
  • automated testing reduces labor costs and improves testing efficiency. Improve the optimization cycle of the voice interaction system and reduce the cost of trial and error.
  • the effect of the interaction had to be achieved through product sales and user feedback collected to obtain the interaction success rate and other indicators. Now it is possible to test the estimated value of each scenario before the product is sold.
  • FIG. 1 is a flowchart of a test method for a full-duplex voice interaction system according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of parallel modules of a test method for a full-duplex voice interaction system provided by an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a test device for a full-duplex voice interaction system according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of an embodiment of the electronic device of the present invention.
  • the embodiment of the present invention provides a test method for a full-duplex voice interaction system.
  • the method is used for testing equipment.
  • the testing equipment may be electronic equipment such as a computer, which is not limited by the present invention.
  • Fig. 1 is a flowchart of a test method for a full-duplex voice interaction system provided by an embodiment of the present invention, which includes the following steps:
  • S11 Mix the valid corpus set related to the test scenario with the invalid corpus set that has nothing to do with the test scenario, and determine the scenario mixed corpus;
  • S12 Play each corpus audio in the scene mixed corpus to the voice interaction device to be tested equipped with the full-duplex voice interaction system;
  • the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text
  • the second log records the decision result for each corpus audio, and the decision includes response and discard;
  • S15 Obtain the number of false responses based on the number of false responses based on the number of false responses and the total number of played corpus audio, and determine the number of false responses by counting the number of logs in the second log that are not expected to output the response decision result but actually received the response decision result. Response rate.
  • test preparations are required:
  • the corpus should cover at least one specific scene, such as smart home, TV, car, education, etc., and cover the vocabulary, business sentence patterns, and high-frequency fields of the scene.
  • Invalid corpus a copy of irrelevant corpus outside the specified scene.
  • Background noise audio select the corresponding background noise according to the scene, such as: fan/air conditioner, stove/kitchenware, music/TV program, car engine/whistle, thunderstorm/gale, noisy store, etc.
  • Computer used to read the above corpus audio, and control the playback; and read the log of the device to be tested, and statistically analyze the test results. All devices to be tested are connected to the computer.
  • Two audio playback devices such as stereo/artificial mouth, used to play audio.
  • Equipment to be tested connected to a computer, the computer can read the log of the equipment in real time.
  • corpus corpus is a text containing one or more sentences, and the corresponding audio recording of each sentence.
  • System decision result The system outputs the decision result of the operation instruction through the input audio/information data.
  • System interaction result The system outputs the result of responding to the current state through the input audio/information and other data, including: the recognition result (text) of the response, the text to be broadcast, the action description of the device, etc.
  • each module is no longer in a serial relationship, but there is a parallel calculation of multiple module data, as shown in Figure 2. Therefore, simply multiplying the "success rate" of each module can not get the success rate of the entire system.
  • step S11 obtain a certain number of valid corpus covering the specified scene prepared in the above steps, for example, "I want to listen to Andy Lau's song” in the smart home, and a certain number of invalid corpus irrelevant to the specified scene, such as " ⁇ Bichi Dialogue Workshop provides voice recognition capabilities".
  • the effective corpus set and the invalid corpus set are combined and randomly sorted to determine the scene mixed corpus.
  • each corpus audio in the scene mixed corpus is played one by one to the voice interaction device to be tested equipped with the full-duplex voice interaction system through a prepared playback device, such as a speaker.
  • the method further includes:
  • the voice interaction device to be tested is tested based on the corpus audio with background noise.
  • step S13 since the device under test is connected to the computer, the computer can read the log of the device in real time, and the corresponding log information can be obtained by obtaining the work log of the voice interaction device under test.
  • Received interaction result//It contains: the recognition result of the response (text); the text to be broadcast; the action instruction on the device side
  • logs 2, 3, 4, 6, 7, 9 that is, the serial numbers 2, 3, 4, 6, 7, and 9 in the log example above. 9).
  • the log number can be divided. Easy to operate. For example: log 1. 2020-05-11 09:00:00 start to play valid corpus 1, the corpus text is "I want to listen to Andy Lau's song", log 1.
  • the corpus text is "Sibiz Dialogue Workshop, providing speech recognition capabilities" and it is divided into the first log that records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text.
  • the decision result can also be received.
  • the "decision result” in log 5 and log 10 divides it into the second log.
  • the decision result includes response, interruption, and discard.
  • step S14 taking the log of invalid corpus as an example, the finally obtained interactive decision is [discard], that is, no interactive response is made to the input of this sentence. That is to say, because invalid corpus should be discarded, but in actual use, the speech that should not be responded may be mistakenly responded to.
  • rejection rate the number of all [discarded] logs/the number of invalid corpora played.
  • step S15 but not all decision results are in line with expectations, such as log 5 in the invalid corpus, there should be no [interrupted] decision results; the same as log 10 in the invalid corpus, if it is not discarded but an interaction As a result, it did not meet expectations.
  • False response rate number of false responses/number of corpora played.
  • the end-to-end test of the full-duplex voice interaction system is realized, and the index for the unique characteristics of the full-duplex can be accurately obtained.
  • the full-duplex interactive system features are fully covered, the online interaction effect is improved, and the necessary data reproduction and indicator support are provided.
  • automated testing reduces labor costs and improves testing efficiency. Improve the optimization cycle of the voice interaction system and reduce the cost of trial and error.
  • the effect of the interaction had to be achieved through product sales and user feedback collected to obtain the interaction success rate and other indicators. Now it is possible to test the estimated value of each scenario before the product is sold.
  • the work log includes at least a third log
  • the third log records interaction results for each corpus audio, where the interaction results include: the text to be broadcast of the corpus audio, and the device action instruction of the corpus audio;
  • the interaction success rate is determined based on the number of successful interactions and the total number of corpus audio played.
  • the log 10 of “received interaction result” is determined as the third log, which is used to record the interaction result for each corpus audio, specifically including ⁇ recognition result, text to be broadcast, action instruction ⁇ .
  • the corresponding expected broadcast text and the expected device action instruction can be determined.
  • the expected broadcast text of "I want to listen to Andy Lau's song” is: “Play “Forget Love” for you”
  • the action instruction is to call the music skill to play "Forget Love”.
  • the success rate of the interaction does not need to look at the recognition result, because even if the recognition result is incorrect, the correct interaction can be obtained. For example, if you mistakenly identify “Idol listens to Andy Lau's song”, but it is very likely that the actual broadcast text is still "Play “Forget Love” for you", and the actual action instruction is to call music skills to play "Forget Love”.
  • interaction success rate the number of all expected corpora/the number of all played corpora. Obtain the interaction success rate of the full-duplex voice interaction system.
  • the third log records an interaction result for each corpus audio, and the interaction result further includes: a recognition result of the corpus audio;
  • the recognition accuracy rate and the recognition sentence accuracy rate are determined.
  • the "recognition sentence accuracy rate” and “recognition word accuracy rate” can be obtained, which is the same as the half-duplex calculation method, but Half-duplex is to compare log 1 with log 9, while in full-duplex, log to 10 is compared.
  • the recognition result of the whole sentence is "how to write the cloud character of the cloud”
  • the full-duplex response content may be "the cloud of the cloud”.
  • full-duplex and half-duplex The difference between full-duplex and half-duplex is not in playback, but in interaction.
  • the interaction in half-duplex is one-to-one, and there will be a response when there is a request.
  • full duplex mode the interaction is many-to-many.
  • One request may have multiple responses, and multiple requests may have only one response. This has led to the impact on the accuracy of words and sentences.
  • the work log includes at least a fourth log, a fifth log, and a sixth log;
  • the fourth log records a first time stamp for the end of playback of each corpus audio
  • the fifth log records a second time stamp for determining the recognition result after the audio of each corpus ends playing
  • the sixth log records a third time stamp that determines the moment of the interaction result for each corpus audio
  • the recognition response time of the voice interaction device to be tested is determined based on the time difference between the second time stamp and the first time stamp.
  • the interaction response time of the voice interaction device to be tested is determined based on the time difference between the third time stamp and the second time stamp.
  • the timestamp "2020-05-11 09:00:03" in log 8 is determined to be the fourth log; the timestamp "2020-05-11 09:00:04" in log 9 is determined to be the fifth log ; The timestamp "2020-05-11 09:00:05" with the interactive result in log 10 is determined to be the sixth log.
  • Figure 3 is a schematic structural diagram of a test device for a full-duplex voice interaction system provided by an embodiment of the present invention. Test method and configure it in the test equipment.
  • a test device for a full-duplex voice interaction system includes: a corpus determining program module 11, a test program module 12, a log acquisition program module 13, a rejection rate determining program module 14 and a false response rate determining program module Program module 15.
  • the corpus determining program module 11 is used to mix the valid corpus related to the test scene and the invalid corpus not related to the test scene to determine the scene mixed corpus; the testing program module 12 is used to load the full-duplex voice The voice interaction device to be tested of the interactive system plays each corpus audio in the scene mixed corpus; the log acquisition program module 13 is used to obtain the work log of the voice interaction device to be tested, and the work log includes at least the first log and the first log.
  • the second log wherein-the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text,-the second log records the decision result for each corpus audio, the decision Including response and discard;
  • the rejection rate determination program module 14 is used to determine the rejection rate based on the number of invalid corpus audio in the first log and the number of discarded results in the second log;
  • the false response rate determination program module 15 It is used to obtain the number of false responses by counting the number of logs in the second log that are expected to not output the response decision result but actually receive the response decision result, and determine the number of false responses based on the number of false responses and the total number of corpus audio played Response rate.
  • the embodiment of the present invention also provides a non-volatile computer storage medium.
  • the computer storage medium stores computer-executable instructions. Test Methods;
  • the non-volatile computer storage medium of the present invention stores computer executable instructions, and the computer executable instructions are set as:
  • the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text
  • the second log records the decision result for each corpus audio, and the decision includes response and discard;
  • the false response rate is determined based on the number of false responses and the total number of corpus audio played .
  • non-volatile computer-readable storage medium it can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention.
  • One or more program instructions are stored in a non-volatile computer-readable storage medium, and when executed by a processor, execute the test method for a full-duplex voice interaction system in any of the foregoing method embodiments.
  • the non-volatile computer-readable storage medium may include a storage program area and a storage data area.
  • the storage program area may store an operating system and an application program required by at least one function; Data etc.
  • the non-volatile computer-readable storage medium may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the non-volatile computer-readable storage medium may optionally include memories remotely provided with respect to the processor, and these remote memories may be connected to the device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor , The instruction is executed by the at least one processor, so that the at least one processor can execute:
  • the first log records the valid/invalid attributes identified for each corpus audio and the corresponding corpus text
  • the second log records the decision result for each corpus audio, and the decision includes response and discard;
  • the false response rate is determined based on the number of false responses and the total number of corpus audio played .
  • the work log includes at least a third log
  • the third log records interaction results for each corpus audio, where the interaction results include: the text to be broadcast of the corpus audio, and the device action instruction of the corpus audio;
  • the interaction success rate is determined based on the number of successful interactions and the total number of corpus audio played.
  • the third log records an interaction result for each corpus audio, and the interaction result further includes: a recognition result of the corpus audio;
  • the recognition accuracy rate and the recognition sentence accuracy rate are determined.
  • the work log includes at least a fourth log and a fifth log
  • the fourth log records a first time stamp for the end of playback of each corpus audio
  • the fifth log records a second time stamp for determining the recognition result after the audio of each corpus ends playing
  • the recognition response time of the voice interaction device to be tested is determined based on the time difference between the second time stamp and the first time stamp.
  • the work log includes at least a sixth log
  • the sixth log records a third time stamp that determines the moment of the interaction result for each corpus audio
  • the interaction response time of the voice interaction device to be tested is determined based on the time difference between the third time stamp and the second time stamp.
  • the processor is further configured to: while playing each corpus audio in the scene mixed corpus to the voice interaction device to be tested equipped with the full-duplex voice interaction system, to the voice interaction device equipped with the full-duplex voice
  • the voice interaction device to be tested of the interactive system plays a preset background noise; the voice interaction device to be tested is tested based on the corpus audio with background noise.
  • the voice interaction device equipped with the full-duplex voice interaction system is simultaneously
  • the preset background noise played by the test voice interactive device includes:
  • Fig. 4 is a schematic diagram of the hardware structure of an electronic device for performing a correction method for voice dialogue according to another embodiment of the present invention. As shown in Fig. 4, the device includes:
  • One or more processors 410 and a memory 420 are taken as an example in FIG. 4.
  • the device for performing the correction method for the voice dialogue may further include: an input device 430 and an output device 440.
  • the processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or in other ways. In FIG. 4, the connection by a bus is taken as an example.
  • the memory 420 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the method for correcting voice conversations in the embodiments of the present invention Corresponding program instructions/modules.
  • the processor 410 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions, and modules stored in the memory 420, that is, implements the correction method for the voice dialogue in the foregoing method embodiment.
  • the memory 420 may include a storage program area and a storage data area.
  • the storage program area may store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the correction device for voice dialogue. Wait.
  • the memory 420 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 420 may optionally include a memory remotely provided with respect to the processor 410, and these remote memories may be connected to a correction device for voice dialogue via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 430 may receive inputted numeric or character information, and generate signals related to user settings and function control of the correction device for voice dialogue.
  • the output device 440 may include a display device such as a display screen.
  • the one or more modules are stored in the memory 420, and when executed by the one or more processors 410, the correction for the voice dialogue in any of the foregoing method embodiments is performed.
  • test equipment of the embodiment of the present invention exists in various forms, including but not limited to:
  • Mobile communication equipment This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communications.
  • Such terminals include: smart phones, multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has calculation and processing functions, and generally also has mobile Internet features.
  • Such terminals include: PDA, MID and UMPC devices, such as tablet computers.
  • Portable entertainment equipment This type of equipment can display and play multimedia content. Such devices include: audio, video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units.
  • Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
  • each implementation manner can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the above technical solution essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic A disc, an optical disc, etc., include a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in each embodiment or some parts of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

提供了一种用于全双工语音交互系统的测试方法、测试装置、电子设备及存储介质,其中该方法包括:将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集(S11);向搭载全双工语音交互系统的待测试语音交互设备播放场景混合语料集内各个语料音频(S12);获取待测试语音交互设备的工作日志,工作日志至少包括第一日志和第二日志(S13);通过统计第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于误响应次数与播放的语料音频的总数,确定误响应率(S15)。该方法实现了全双工语音交互系统的端到端的测试,准确得出针对于全双工独有特性的指标。

Description

用于全双工语音交互系统的测试方法及装置 技术领域
本发明涉及智能语音领域,尤其涉及一种用于全双工语音交互系统的测试方法及装置。
背景技术
为了测试语音交互系统的性能,可以对语音交互链路上各个模块进行独立测试。例如,通过唤醒/信号处理模块,测试唤醒率、唤醒时间、功耗等;通过语音识别模块,测试句错率、字错率;通过语义理解模块,测试准确率、召回率、解析准确率;基于多人主观评价对语音合成模块打分。现有绝大部分的语音交互系统使用半双工交互。在半双工交互中,各个模块之间有绝对有序的依赖关系,整个系统通过串行调用各个模块即可完成交互。在这种情况下,按模块独立测试是能满足要求的。
在实现本发明过程中,发明人发现相关技术中至少存在如下问题:
由于模块独立测试,测试指标各异,缺乏对整体语音交互系统的测试方法和评价指标。在全双工系统等多个模块融合决策的复杂系统中,各模块的指标已不能满足评价的需求。例如,对于半双工对话来说,用户每说一句话,半双工语音交互系统都是会进行响应的。而全双工语音交互系统只有有效的请求才会响应。
发明内容
为了至少解决现有技术中缺乏对整体语音交互系统的测试方法和评价指标,无法针对全双工语音交互系统进行合理有效测试的问题。
第一方面,本发明实施例提供一种用于全双工语音交互系统的测试方法,包括:
将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场 景混合语料集内各个语料音频;
获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
第二方面,本发明实施例提供一种用于全双工语音交互系统的测试装置,包括:
语料集确定程序模块,用于将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
测试程序模块,用于向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
日志获取程序模块,用于获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
拒识率确定程序模块,用于基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
误响应率确定程序模块,用于通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
第三方面,提供一种电子设备,其包括:至少一个处理器,以及与所 述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例的用于全双工语音交互系统的测试方法的步骤。
第四方面,本发明实施例提供一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现本发明任一实施例的用于全双工语音交互系统的测试方法的步骤。
本发明实施例的有益效果在于:实现全双工语音交互系统的端到端的测试,准确得出针对于全双工独有特性的指标。在测试过程中,对全双工交互系统特性做到全覆盖,改进了线上交互效果,提供必要的数据复现和指标支撑,同时自动化的测试减少了人力成本,提高测试效率。提高了语音交互系统的优化周期,减少试错成本。过去在未能测试系统交互成功率之前,交互的效果要通过产品销售、收集用户反馈后得到交互成功率等指标,现在可以在产品销售前通过测试得到各个场景下的估计值。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一实施例提供的一种用于全双工语音交互系统的测试方法的流程图;
图2是本发明一实施例提供的一种用于全双工语音交互系统的测试方法的并行模块示意图;
图3是本发明一实施例提供的一种用于全双工语音交互系统的测试装置的结构示意图;
图4为本发明的电子设备的一实施例的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发 明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种用于全双工语音交互系统的测试方法,该方法用于测试设备,测试设备可以为计算机等电子设备,本发明对此不作限定。
如图1所示为本发明一实施例提供的一种用于全双工语音交互系统的测试方法的流程图,包括如下步骤:
S11:将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
S12:向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
S13:获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
S14:基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
S15:通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
在本实施方式中,需要进行测试准备:
1、有效语料:语料应至少覆盖一特定场景,如:智能家居、电视、车载、教育等,并且覆盖该场景词汇、业务句式、高频领域。
2、无效语料:准备指定的场景外的,不相关语料一份。
3、背景噪音音频:根据场景选择相应的背景噪音,如:风扇/空调、灶具/厨具、音乐/电视节目、汽车发动机/鸣笛、雷雨/大风、嘈杂卖场等。
4、测试设备:
计算机:用于读取上述语料音频,并控制播放;并读取待测试设备的日志,统计分析得出测试结果。所有待测试设备与该计算机相连。
两台音频播放设备:如音响/人工嘴,用于播放音频。
待测试设备:与计算机相连,计算机对该设备日志实时可读。
其中:“语料”:语料是包含一句或多句文本,以及每句对应的录音音频。“系统决策结果”:系统通过输入的音频/信息等数据,输出操作指令类的决策结果。“系统交互结果”:系统通过输入的音频/信息等数据,输出响应当前状态的结果,其中包含:响应的识别结果(文本),待播报的文本,设备端的动作描述等。
在全双工语音交互系统中各个模块之间不再是串行关系,而是存在多个模块数据的并行计算,具体如图2所示。因此简单将各个模块的“成功率”相乘无法得到整个系统的成功率。
对于步骤S11,获取上述步骤中准备好的覆盖指定场景的一定数量的有效语料,例如,智能家居中“我想听刘德华的歌”,和一定数量的与指定场景无关的无效语料,例如“思必驰对话工场,提供语音识别能力”。之后将有效语料集合与无效语料集合合并后随机排序,确定为场景混合语料集。
对于步骤S12,通过准备好的播放设备,例如音响,向搭载所述全双工语音交互系统的待测试语音交互设备逐条播放所述场景混合语料集内各个语料音频。
作为一种实施方式,所述方法还包括:
向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频的同时,向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音;
基于带有背景噪音的语料音频对所述待测试语音交互设备进行测试。
通过第一播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
通过第二播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音。
在本实施方式中,由于预先准备了两个播放设备,这里使用播放语料 音频的音响与播放背景噪音的音响同时播放,这样通过对播放的语音进行加噪,还可以进一步测试字准率与句准率的效果。
待测试设备状态是否符合预期的判断如下:
1、设备是否一直于拾音状态。
2、在播放有效语料音频时,系统是否做出响应决策。
3、在播放无效语料音频后,系统是否做出丢弃决策。
4、在播放有效语料音频后,系统是否做出交互响应,响应结果是否符合预期。
5、在播放退出相关语料音频后,系统是否关闭拾音。
对于步骤S13,由于待测试设备与计算机相连,计算机对该设备日志实时可读,通过获取所述待测试语音交互设备的工作日志可以获得相应的日志信息。
例如:
1、播放一条语料音频,并记录是否是有效语料以及该句文本。
2、播放语料音频中,系统是否给出决策结果,决策结果是否符合预期。
3、播放语料音频后,系统是否给出交互结果
a)响应的识别结果(文本)是否符合预期;
b)待播报的文本是否符合预期;
c)设备端的动作描述是否符合预期。
4、播放开始至系统返回识别结果的第一个字的时间间隔。
5、播放完毕至系统返回整句识别结果的时间间隔。
6、播放完毕至系统返回完整交互结果的时间间隔。
更具体的:
1. 2020-05-11 09:00:00开始播放有效语料1,语料文本为“我想听刘德华的歌”
2. 2020-05-11 09:00:01收到实时识别结果:“我”
3. 2020-05-11 09:00:01收到实时识别结果:“我想”
4. 2020-05-11 09:00:02收到实时识别结果:“我想听”
5. 2020-05-11 09:00:02收到决策结果:打断
6. 2020-05-11 09:00:03收到实时识别结果:“我想听留的”
7. 2020-05-11 09:00:03收到实时识别结果:“我想听刘德华”
8. 2020-05-11 09:00:03结束播放有效语料1
9. 2020-05-11 09:00:04收到实时识别结果:“我想听刘德华的歌”
10. 2020-05-11 09:00:05收到交互结果//其中包含:响应的识别结果(文本);待播报的文本;设备端的动作指令
1. 2020-05-11 09:01:00开始播放无效语料1,语料文本为“思必驰对话工场,提供语音识别能力”
2. 2020-05-11 09:01:01收到实时识别结果:“思必驰”
3. 2020-05-11 09:01:01收到实时识别结果:“思必驰对话”
4. 2020-05-11 09:01:02收到实时识别结果:“思必驰对话工场”
5. 2020-05-11 09:01:02收到决策结果:打断
6. 2020-05-11 09:01:03收到实时识别结果:“思必驰对话工场,提供”
7. 2020-05-11 09:01:03收到实时识别结果:“思必驰对话工场,提供语音识别”
8. 2020-05-11 09:01:03结束播放无效语料1
9. 2020-05-11 09:01:04收到实时识别结果:“思必驰对话工场,提供语音识别能力”
10. 2020-05-11 09:01:05收到决策结果:丢弃
在测试中,在播放过程中,可以收到实时的识别结果,如日志2、3、4、6、7、9(也就是上文日志举例中的序号2、3、4、6、7、9)。但是为了专项专测在进行测试某个测试项时,仅需要日志中的部分信息。进而可以对日志的编号进行划分。便于操作。例如:日志1. 2020-05-11 09:00:00开始播放有效语料1,语料文本为“我想听刘德华的歌”、日志1. 2020-05-1109:01:00开始播放无效语料1,语料文本为“思必驰对话工场,提供语音识别能力”将其划分为记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本的第一日志。同样的,也可以收到决策结果,如日志5、日志10中的“决策结果”将其划分为第二日志,具体的,决策结果包括响应、打断和丢弃。
对于步骤S14,以无效语料的日志举例,最后获得的交互决策为【丢 弃】,即对这句输入不做交互响应。也就是说,因为无效语料都应该丢弃,但是在实际使用中,可能会误将本不该响应的语音,错误的响应了。从而确定拒识率。拒识率=所有【丢弃】的日志条数/播放的无效语料数。
对于步骤S15,但是不是所有决策结果都是符合预期的,如无效语料中日志5,就不应该有【打断】的决策结果;同样如无效语料中的日志10,如果不是丢弃而是一个交互结果,那么也是不符合预期的。
由于第一日志中已经记录有相应的语料文本例如“我想听刘德华的歌”、“思必驰对话工场,提供语音识别能力”,可以通过这些文本确定出预期的决策结果。
通过统计第二日志中这些预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数。误响应率=误响应次数/播放的语料条数。
“误响应率”、“拒识率”这些在半双工上不存在。拿误响应率来说,半双工上所有请求都会响应,然而全双工上只有有效的请求才会响应,所以才会存在误响应率。同样半双工下也不存在打断,设备输出的合成音在播放时,设备的收音设备并不在开启状态,所以也无法打断。
通过该实施方式可以看出,实现全双工语音交互系统的端到端的测试,准确得出针对于全双工独有特性的指标。在测试过程中,对全双工交互系统特性做到全覆盖,改进了线上交互效果,提供必要的数据复现和指标支撑,同时自动化的测试减少了人力成本,提高测试效率。提高了语音交互系统的优化周期,减少试错成本。过去在未能测试系统交互成功率之前,交互的效果要通过产品销售、收集用户反馈后得到交互成功率等指标,现在可以在产品销售前通过测试得到各个场景下的估计值。
作为一种实施方式,所述工作日志至少包括第三日志;
-所述第三日志记录有针对各个语料音频的交互结果,其中,所述交互结果包括:所述语料音频的待播报文本、所述语料音频的设备动作指令;
通过所述第一日志中各个语料音频的语料文本确定对应的预期待播报文本以及预期设备动作指令;
当所述各个语料音频的交互结果内的所述待播报文本与对应的预期 待播报文本相符,并且所述设备动作指令与对应的预期设备动作指令相符时,所述交互结果符合预期;
通过统计所述第三日志中符合预期的交互结果的日志条数,得到交互成功次数,基于所述交互成功次数与播放的语料音频的总数,确定交互成功率。
在本实施方式中,将“收到交互结果”的日志10中确定为第三日志,用于记录针对各个语料音频的交互结果,具体的,包括{识别结果,待播报文本,动作指令}。
同样的,由于第一日志中具有各个语料音频的语料文本,从而可以确定出对应的预期待播报文本以及预期设备动作指令。例如,“我想听刘德华的歌”的预期待播报文本为:“为您播放《忘情水》”,动作指令为调用音乐技能播放《忘情水》。
交互成功率,不需要看识别结果,因为识别结果即使不正确,也可以得到正确的交互。例如,假如错误的识别了“偶像听刘德华的歌”,但是很有可能实际的播报文本还是为“为您播放《忘情水》”,实际的动作指令还是为调用音乐技能播放《忘情水》。
如果1为符合预期,0为不符合预期。那么当{识别结果,待播报文本,动作指令}与预期相比为011,或111时,交互结果符合预期。
最后通过“交互成功率”=所有符合预期的语料数/所有播放的语料数。得到全双工语音交互系统的交互成功率。
通过该实施方式可以看出,半双工由于是串行执行,所以各个模块分别测试,最终得到交互成功率的结果。然而全双工是并行执行,所以只有通过本方法的端到端测试才能得到交互成功率的结果。
作为一种实施方式,所述第三日志记录有针对各个语料音频的交互结果,所述交互结果还包括:所述语料音频的识别结果;
基于所述第一日志中各个语料音频的语料文本与所述语料音频的识别结果,确定识别字准率以及识别句准率。
通过比较第一日志与第三日志(日志10)内的响应的识别结果(文本),可以得到“识别句准率”、“识别字准率”,这里和半双工的计算方式一 致,只是半双工是将日志1与日志9比较,而全双工下是日志去10比较。比如整句识别结果为“云朵的云字怎么写”,而全双工的响应内容可能是“云朵的云”。
全双工与半双工的区别并不是在播放上,而是在交互上的区别。半双工下交互是1对1的,有一个请求就会有一个响应。而全双工下交互是多对多的,一个请求可能有多个响应,多个请求也可能只有一个响应。这才导致了对识别字准、句准的影响。
作为一种实施方式,所述工作日志至少包括第四日志、第五日志、第六日志;
-所述第四日志记录有针对有各个语料音频结束播放的第一时间戳;
-所述第五日志记录有针对有各个语料音频结束播放后确定识别结果的第二时间戳;
-所述第六日志记录有确定针对各个语料音频的交互结果时刻的第三时间戳;
基于所述第二时间戳与所述第一时间戳的时间差确定所述待测试语音交互设备的识别响应时间。
基于所述第三时间戳与所述第二时间戳的时间差确定所述待测试语音交互设备的交互响应时间。
其中,日志8中的时间戳“2020-05-11 09:00:03”,确定为第四日志;日志9中的时间戳“2020-05-11 09:00:04”确定为第五日志;日志10中的带有交互结果的时间戳“2020-05-11 09:00:05”确定为第六日志。
通过“2020-05-11 09:00:04”-“2020-05-11 09:00:03”=1秒,可以得到系统响应识别的响应时间为1秒。
通过“2020-05-11 09:00:05”-“2020-05-11 09:00:04”=1秒,可以得到系统响应交互的响应时间为1秒。
如图3所示为本发明一实施例提供的一种用于全双工语音交互系统的测试装置的结构示意图,该装置可执行上述任意实施例所述的用于全双工语音交互系统的测试方法,并配置在测试设备中。
本实施例提供的一种用于全双工语音交互系统的测试装置包括:语料集确定程序模块11,测试程序模块12,日志获取程序模块13,拒识率确定程序模块14和误响应率确定程序模块15。
其中,语料集确定程序模块11用于将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;测试程序模块12用于向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;日志获取程序模块13用于获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;拒识率确定程序模块14用于基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;误响应率确定程序模块15用于通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的用于全双工语音交互系统的测试方法;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:
将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本发明实施例中的方法对应的程序指令/模块。一个或者多个程序指令存储在非易失性计算机可读存储介质中,当被处理器执行时,执行上述任意方法实施例中的用于全双工语音交互系统的测试方法。
非易失性计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据装置的使用所创建的数据等。此外,非易失性计算机可读存储介质可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行:
将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
-所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及 对应的语料文本,
-所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
在一些实施例中,所述工作日志至少包括第三日志;
-所述第三日志记录有针对各个语料音频的交互结果,其中,所述交互结果包括:所述语料音频的待播报文本、所述语料音频的设备动作指令;
通过所述第一日志中各个语料音频的语料文本确定对应的预期待播报文本以及预期设备动作指令;
当所述各个语料音频的交互结果内的所述待播报文本与对应的预期待播报文本相符,并且所述设备动作指令与对应的预期设备动作指令相符时,所述交互结果符合预期;
通过统计所述第三日志中符合预期的交互结果的日志条数,得到交互成功次数,基于所述交互成功次数与播放的语料音频的总数,确定交互成功率。
在一些实施例中,所述第三日志记录有针对各个语料音频的交互结果,所述交互结果还包括:所述语料音频的识别结果;
基于所述第一日志中各个语料音频的语料文本与所述语料音频的识别结果,确定识别字准率以及识别句准率。
在一些实施例中,所述工作日志至少包括第四日志、第五日志;
-所述第四日志记录有针对有各个语料音频结束播放的第一时间戳;
-所述第五日志记录有针对有各个语料音频结束播放后确定识别结果的第二时间戳;
基于所述第二时间戳与所述第一时间戳的时间差确定所述待测试语音交互设备的识别响应时间。
在一些实施例中,所述工作日志至少包括第六日志;
-所述第六日志记录有确定针对各个语料音频的交互结果时刻的第三时间戳;
基于所述第三时间戳与所述第二时间戳的时间差确定所述待测试语音交互设备的交互响应时间。
在一些实施例中,处理器还用于:向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频的同时,向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音;基于带有背景噪音的语料音频对所述待测试语音交互设备进行测试。
在一些实施例中,所述向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频的同时,向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音包括:
通过第一播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
通过第二播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音。
图4是本发明另一实施例提供的执行用于语音对话的纠正方法的电子设备的硬件结构示意图,如图4所示,该设备包括:
一个或多个处理器410以及存储器420,图4中以一个处理器410为例。
执行用于语音对话的纠正方法的设备还可以包括:输入装置430和输出装置440。
处理器410、存储器420、输入装置430和输出装置440可以通过总线或者其他方式连接,图4中以通过总线连接为例。
存储器420作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本发明实施例中的用于语音对话的纠正方法对应的程序指令/模块。处理器410通过运行存储在存储器420中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例用于语音对话的纠正方法。
存储器420可以包括存储程序区和存储数据区,其中,存储程序区可 存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据用于语音对话的纠正装置的使用所创建的数据等。此外,存储器420可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器420可选包括相对于处理器410远程设置的存储器,这些远程存储器可以通过网络连接至用于语音对话的纠正装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置430可接收输入的数字或字符信息,以及产生与用于语音对话的纠正装置的用户设置以及功能控制有关的信号。输出装置440可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器420中,当被所述一个或者多个处理器410执行时,执行上述任意方法实施例中的用于语音对话的纠正。
本发明实施例的测试设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如平板电脑。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器,掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)其他具有数据处理功能的电子装置。
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”,不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方 法、物品或者设备中还存在另外的相同要素。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (10)

  1. 一种用于全双工语音交互系统的测试方法,用于测试设备,所述方法包括:
    将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
    向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
    获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
    -所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
    -所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
    基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
    通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
  2. 根据权利要求1所述的方法,其中,所述工作日志至少包括第三日志;
    -所述第三日志记录有针对各个语料音频的交互结果,其中,所述交互结果包括:所述语料音频的待播报文本、所述语料音频的设备动作指令;
    通过所述第一日志中各个语料音频的语料文本确定对应的预期待播报文本以及预期设备动作指令;
    当所述各个语料音频的交互结果内的所述待播报文本与对应的预期待播报文本相符,并且所述设备动作指令与对应的预期设备动作指令相符时,所述交互结果符合预期;
    通过统计所述第三日志中符合预期的交互结果的日志条数,得到交互成功次数,基于所述交互成功次数与播放的语料音频的总数,确定交互成 功率。
  3. 根据权利要求2所述的方法,其中,所述第三日志记录有针对各个语料音频的交互结果,所述交互结果还包括:所述语料音频的识别结果;
    基于所述第一日志中各个语料音频的语料文本与所述语料音频的识别结果,确定识别字准率以及识别句准率。
  4. 根据权利要求1所述的方法,其中,所述工作日志至少包括第四日志、第五日志;
    -所述第四日志记录有针对有各个语料音频结束播放的第一时间戳;
    -所述第五日志记录有针对有各个语料音频结束播放后确定识别结果的第二时间戳;
    基于所述第二时间戳与所述第一时间戳的时间差确定所述待测试语音交互设备的识别响应时间。
  5. 根据权利要求4所述的方法,其中,所述工作日志至少包括第六日志;
    -所述第六日志记录有确定针对各个语料音频的交互结果时刻的第三时间戳;
    基于所述第三时间戳与所述第二时间戳的时间差确定所述待测试语音交互设备的交互响应时间。
  6. 根据权利要求1所述的方法,其中,所述方法还包括:
    向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频的同时,向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音;
    基于带有背景噪音的语料音频对所述待测试语音交互设备进行测试。
  7. 根据权利要求6所述的方法,其中,所述向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音 频的同时,向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音包括:
    通过第一播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
    通过第二播放设备向搭载所述全双工语音交互系统的待测试语音交互设备播放预设背景噪音。
  8. 一种用于全双工语音交互系统的测试装置,包括:
    语料集确定程序模块,用于将和测试场景相关的有效语料集与和测试场景无关的无效语料集混合,确定场景混合语料集;
    测试程序模块,用于向搭载所述全双工语音交互系统的待测试语音交互设备播放所述场景混合语料集内各个语料音频;
    日志获取程序模块,用于获取所述待测试语音交互设备的工作日志,所述工作日志至少包括第一日志和第二日志,其中,
    -所述第一日志记录有针对各个语料音频所识别的有效/无效属性以及对应的语料文本,
    -所述第二日志记录有针对有各个语料音频的决策结果,所述决策包括响应和丢弃;
    拒识率确定程序模块,用于基于所述第一日志中的无效语料音频的数量、所述第二日志中的丢弃结果数量确定拒识率;
    误响应率确定程序模块,用于通过统计所述第二日志中预期不应输出响应决策结果而实际收到响应决策结果的日志条数,得到误响应次数,基于所述误响应次数与播放的语料音频的总数,确定误响应率。
  9. 一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述方法的步骤。
  10. 一种存储介质,其上存储有计算机程序,其特征在于,该程序被 处理器执行时实现权利要求1-7中任一项所述方法的步骤。
PCT/CN2020/129352 2020-05-20 2020-11-17 用于全双工语音交互系统的测试方法及装置 WO2021232710A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022569085A JP2023526285A (ja) 2020-05-20 2020-11-17 全二重音声インタラクションシステムのテスト方法及び装置
EP20936925.5A EP4156175A4 (en) 2020-05-20 2020-11-17 TESTING METHOD AND APPARATUS FOR FULL DUPLEX VOICE INTERACTION SYSTEM
US17/990,149 US20230077478A1 (en) 2020-05-20 2022-11-18 Method and apparatus for testing full-duplex speech interaction system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010432769.6A CN113707128B (zh) 2020-05-20 2020-05-20 用于全双工语音交互系统的测试方法及系统
CN202010432769.6 2020-05-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/990,149 Continuation US20230077478A1 (en) 2020-05-20 2022-11-18 Method and apparatus for testing full-duplex speech interaction system

Publications (1)

Publication Number Publication Date
WO2021232710A1 true WO2021232710A1 (zh) 2021-11-25

Family

ID=78645351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129352 WO2021232710A1 (zh) 2020-05-20 2020-11-17 用于全双工语音交互系统的测试方法及装置

Country Status (5)

Country Link
US (1) US20230077478A1 (zh)
EP (1) EP4156175A4 (zh)
JP (1) JP2023526285A (zh)
CN (1) CN113707128B (zh)
WO (1) WO2021232710A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013365B (zh) * 2023-03-21 2023-06-02 深圳联友科技有限公司 一种语音全自动化测试的方法

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6477492B1 (en) * 1999-06-15 2002-11-05 Cisco Technology, Inc. System for automated testing of perceptual distortion of prompts from voice response systems
CN106548772A (zh) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 语音识别测试系统及方法
CN108228468A (zh) * 2018-02-12 2018-06-29 腾讯科技(深圳)有限公司 一种测试方法、装置、测试设备及存储介质
CN108899012A (zh) * 2018-07-27 2018-11-27 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) 语音交互设备评测方法、系统、计算机设备和存储介质
CN109032870A (zh) * 2018-08-03 2018-12-18 百度在线网络技术(北京)有限公司 用于测试设备的方法和装置
CN109256115A (zh) * 2018-10-22 2019-01-22 四川虹美智能科技有限公司 一种智能家电的语音检测系统及方法
CN109360550A (zh) * 2018-12-07 2019-02-19 上海智臻智能网络科技股份有限公司 语音交互系统的测试方法、装置、设备和存储介质
CN110379410A (zh) * 2019-07-22 2019-10-25 苏州思必驰信息科技有限公司 语音响应速度自动分析方法及系统
CN110415681A (zh) * 2019-09-11 2019-11-05 北京声智科技有限公司 一种语音识别效果测试方法及系统
CN111179907A (zh) * 2019-12-31 2020-05-19 深圳Tcl新技术有限公司 语音识别测试方法、装置、设备及计算机可读存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008233305A (ja) * 2007-03-19 2008-10-02 Toyota Central R&D Labs Inc 音声対話装置、音声対話方法及びプログラム
CN103680519A (zh) * 2012-09-07 2014-03-26 成都林海电子有限责任公司 卫星移动终端语音编解码器全双工语音输出功能测试方法
EP3157418A4 (en) * 2014-06-23 2018-06-20 Hochman, Eldad Izhak Detection of human-machine interaction errors
DE102017211447B4 (de) * 2017-07-05 2021-10-28 Audi Ag Verfahren zum Auswählen eines Listeneintrags aus einer Auswahlliste einer Bedienvorrichtung mittels Sprachbedienung sowie Bedienvorrichtung
US10574597B2 (en) * 2017-09-18 2020-02-25 Microsoft Technology Licensing, Llc Conversational log replay with voice and debugging information
CN109994108B (zh) * 2017-12-29 2023-08-29 微软技术许可有限责任公司 用于聊天机器人和人之间的会话交谈的全双工通信技术
CN108962260A (zh) * 2018-06-25 2018-12-07 福来宝电子(深圳)有限公司 一种多人命令语音识别方法、系统及存储介质
CN109408264A (zh) * 2018-09-28 2019-03-01 北京小米移动软件有限公司 语音助手错误响应的修正方法、装置、设备及存储介质
CN110010121B (zh) * 2019-03-08 2023-12-26 平安科技(深圳)有限公司 验证应答话术的方法、装置、计算机设备和存储介质
CN110198256B (zh) * 2019-06-28 2020-11-27 上海智臻智能网络科技股份有限公司 客户终端核数确定方法及装置、存储介质、终端
CN110807333B (zh) * 2019-10-30 2024-02-06 腾讯科技(深圳)有限公司 一种语义理解模型的语义处理方法、装置及存储介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6477492B1 (en) * 1999-06-15 2002-11-05 Cisco Technology, Inc. System for automated testing of perceptual distortion of prompts from voice response systems
CN106548772A (zh) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 语音识别测试系统及方法
CN108228468A (zh) * 2018-02-12 2018-06-29 腾讯科技(深圳)有限公司 一种测试方法、装置、测试设备及存储介质
CN108899012A (zh) * 2018-07-27 2018-11-27 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) 语音交互设备评测方法、系统、计算机设备和存储介质
CN109032870A (zh) * 2018-08-03 2018-12-18 百度在线网络技术(北京)有限公司 用于测试设备的方法和装置
CN109256115A (zh) * 2018-10-22 2019-01-22 四川虹美智能科技有限公司 一种智能家电的语音检测系统及方法
CN109360550A (zh) * 2018-12-07 2019-02-19 上海智臻智能网络科技股份有限公司 语音交互系统的测试方法、装置、设备和存储介质
CN110379410A (zh) * 2019-07-22 2019-10-25 苏州思必驰信息科技有限公司 语音响应速度自动分析方法及系统
CN110415681A (zh) * 2019-09-11 2019-11-05 北京声智科技有限公司 一种语音识别效果测试方法及系统
CN111179907A (zh) * 2019-12-31 2020-05-19 深圳Tcl新技术有限公司 语音识别测试方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
EP4156175A1 (en) 2023-03-29
US20230077478A1 (en) 2023-03-16
CN113707128A (zh) 2021-11-26
EP4156175A4 (en) 2023-10-18
JP2023526285A (ja) 2023-06-21
CN113707128B (zh) 2023-06-20

Similar Documents

Publication Publication Date Title
WO2022022536A1 (zh) 音频播放方法、音频播放装置和电子设备
WO2017166651A1 (zh) 语音识别模型训练方法、说话人类型识别方法及装置
US10298640B1 (en) Overlaying personalized content on streaming audio
CN109658935B (zh) 多通道带噪语音的生成方法及系统
US10664774B2 (en) Carpool system
US20230352012A1 (en) Speech skill jumping method for man machine dialogue, electronic device and storage medium
US20190385578A1 (en) Music synthesis method, system, terminal and computer-readable storage medium
CN108012173A (zh) 一种内容识别方法、装置、设备和计算机存储介质
EP4086892A1 (en) Skill voice wake-up method and apparatus
WO2021184794A1 (zh) 对话文本的技能领域确定方法及装置
CN110347848A (zh) 一种演示文稿管理方法及装置
WO2021082133A1 (zh) 人机对话模式切换方法
CN109364477A (zh) 基于语音控制进行打麻将游戏的方法及装置
CN109872726A (zh) 发音评估方法、装置、电子设备和介质
WO2021232710A1 (zh) 用于全双工语音交互系统的测试方法及装置
CN109325180A (zh) 文章摘要推送方法、装置、终端设备、服务器及存储介质
CN110458599A (zh) 测试方法、测试装置及相关产品
WO2021042584A1 (zh) 全双工语音对话方法
CN105897854A (zh) 移动终端闹钟响应方法、装置及系统
CN109712443A (zh) 一种内容跟读方法、装置、存储介质及电子设备
CN106535152A (zh) 一种基于终端的应用数据处理方法、装置及系统
CN109243472A (zh) 一种音频处理方法及音频处理系统
CN109726267B (zh) 用于故事机的故事推荐方法和装置
US10965391B1 (en) Content streaming with bi-directional communication
CN111767083A (zh) 误唤醒音频数据的收集方法、播放设备、电子设备、介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936925

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022569085

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020936925

Country of ref document: EP

Effective date: 20221220