WO2021012495A1 - 语音识别结果测试方法、装置、计算机设备和介质 - Google Patents

语音识别结果测试方法、装置、计算机设备和介质 Download PDF

Info

Publication number
WO2021012495A1
WO2021012495A1 PCT/CN2019/116960 CN2019116960W WO2021012495A1 WO 2021012495 A1 WO2021012495 A1 WO 2021012495A1 CN 2019116960 W CN2019116960 W CN 2019116960W WO 2021012495 A1 WO2021012495 A1 WO 2021012495A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
segment
neural network
feature
features
Prior art date
Application number
PCT/CN2019/116960
Other languages
English (en)
French (fr)
Inventor
刘丽珍
吕小立
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021012495A1 publication Critical patent/WO2021012495A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to a method, device, computer equipment and storage medium for testing speech recognition results.
  • ASR Automatic Speech Recognition, automatic speech recognition technology
  • speech recognition is a multidisciplinary field, which is closely connected with many disciplines such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, and computer science. Due to the diversity and complexity of speech signals, the speech recognition system can only obtain satisfactory performance under certain restricted conditions, and the performance of the speech recognition system has multiple factors. In addition, due to the different conditions of various factors in different application environments, it is easy to cause the low accuracy of ASR emotion recognition in different application scenarios. If ASR is not verified, it is easy to cause speech recognition errors and fail to meet business requirements.
  • a voice recognition result test method device, computer equipment, and medium are provided.
  • a method for testing speech recognition results including:
  • the voice recognition result of each sub-segment is compared with the voice recognition result of each sub-segment carried in the preset standard voice recognition result in the selected application scenario, and the counted voice recognition results are consistent
  • the proportion of sub-segments in the sub-segment can be used to obtain the accuracy of the speech recognition result in the selected application scenario.
  • a voice recognition result test device including:
  • the data acquisition module is used to randomly select user response voice data based on preset speech scripts in any application scenario
  • a dividing module configured to obtain a user segment in the user reply voice data, divide the user segment into a plurality of sub-segments with a preset time length, and assign a sub-segment identifier
  • the feature extraction module is used to extract the acoustic features of each sub-segment, and obtain the emotional label of each sub-segment according to the acoustic features;
  • the splicing and combination module is used to obtain the text data corresponding to each sub-segment by using voice recognition technology, to linearly splice the emotion label of each sub-segment with the corresponding text data, and to add the sub-segment identifier to the emotion Between the tag and the text data, the voice recognition result of each sub-segment is obtained;
  • the test module is configured to compare the voice recognition results of each sub-segment with the voice recognition results of each sub-segment carried in the preset standard voice recognition result in the selected application scenario one by one according to the sub-segment identifier, Count the proportion of sub-segments with consistent speech recognition results to obtain the accuracy of the speech recognition results in the selected application scenario.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the voice recognition result of each sub-segment is compared with the voice recognition result of each sub-segment carried in the preset standard voice recognition result in the selected application scenario, and the counted voice recognition results are consistent
  • the proportion of sub-segments in the sub-segment can be used to obtain the accuracy of the speech recognition result in the selected application scenario.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the voice recognition result of each sub-segment is compared with the voice recognition result of each sub-segment carried in the preset standard voice recognition result in the selected application scenario, and the counted voice recognition results are consistent
  • the proportion of sub-segments in the sub-segment can be used to obtain the accuracy of the speech recognition result in the selected application scenario.
  • FIG. 1 is a schematic flowchart of a method for testing a voice recognition result according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for testing a voice recognition result in another embodiment.
  • Fig. 3 is a schematic flowchart of a method for testing a voice recognition result in another embodiment.
  • Fig. 4 is a block diagram of a speech recognition result testing device according to one or more embodiments.
  • Figure 5 is a block diagram of a computer device according to one or more embodiments.
  • a method for testing speech recognition results includes the following steps:
  • S100 randomly select user response voice data based on a preset speech script in any application scenario.
  • the preset speech script is based on the preparation of dialogue script data in different application scenarios, which specifically includes two parts of data of question and answer, which simulates the dialogue between the customer and the salesperson (service staff) in the real environment.
  • the verbal scripts in different application scenarios can be collected and stored in a database, and the corresponding verbal scripts in different application scenarios are stored in the database.
  • Application scenarios include loan marketing, repayment, loan consulting, etc.
  • the server simulates the question-and-answer voice data based on a preset phonetic script in a certain application scenario.
  • an application scenario set can be constructed for the application scenarios that need to be verified, and any application scenario in the application scenario set can be selected as the current round of test scenarios.
  • S200 Obtain the user segment in the user's reply voice data, divide the user segment into a plurality of sub-segments with a preset time length, and assign the sub-segment identifiers.
  • the server intercepts the reply voice, and divides the user's speech segment in the reply voice into sub-segments of a preset time length.
  • the preset time length is relatively small, such as 3-5 seconds; that is, the user segment is divided into sub-segments of 3-5 seconds length.
  • S300 Extract the acoustic features of each sub-segment, and obtain the emotional label of each sub-segment according to the acoustic features.
  • Acoustic features include sound waves, signals, and intonation.
  • Emotion tags include neutral, happy, sad, angry, surprise, scared, disgusted, excited, etc.
  • a window with a preset time interval can be set to collect acoustic features at a fixed frequency to form an acoustic feature set, and emotion labels obtained from the acoustic feature set.
  • S400 Use speech recognition technology to obtain the text data corresponding to each sub-segment, linearly splice the emotion tag of each sub-segment with the corresponding text data, and add the sub-segment mark between the emotion tag and the text data to obtain each sub-segment Speech recognition result of the transcript.
  • the emotional label of the sub-segment is linearly spliced with the corresponding text data.
  • This linear splicing process can be understood as a "+" process, that is, the two parts of data are pieced together. Add sub-segment identifiers in between, so that the subsequent speech recognition results of each sub-segment can be accurately distinguished.
  • the process of linear splicing can be simply understood as splicing text data with emotional tags. For example, the text data corresponding to a certain sub-segment is "Yes”, the emotional label is "Happy”, and the sub-segment is identified as A, then you get The result of speech recognition is "Yes” A "Happy”.
  • the voice recognition result of each sub-segment is compared with the voice recognition result of each sub-segment carried in the preset standard voice recognition result in the selected application scenario one by one, and the sub-segments with consistent voice recognition results are counted. The proportion of speech segments, and the accuracy of the speech recognition result in the selected application scenario is obtained.
  • Standard speech recognition results are based on expert experience data analysis of historical scripts. It can also be written into the preset speech script database, that is, the speech script file is stored in the preset speech script database-the standard speech recognition result corresponding to each sub-segment and its corresponding relationship, in the standard speech recognition result
  • the text data corresponding to the sub-segment, the identifier of the sub-segment and the corresponding emotion label are carried in it.
  • the user reply voice data of the preset speech script corresponding to each application scenario includes multiple sub-segments.
  • the voice recognition result of each sub-segment is recorded and compared with each sub-segment carried in the preset standard speech recognition result in the selected application scenario.
  • the number of sub-segments whose speech recognition results are consistent, and the proportion of this part of the number of sub-segments to the entire user’s reply voice data including sub-segments is calculated.
  • the ratio obtained is the accuracy of the speech recognition results in the selected application scenario. .
  • the speech recognition results of each sub-segment are: hello A happy, don’t B neutral, goodbye C disgusted; the corresponding standard speech recognition results include: you If A is neutral, B is not neutral, and C is disgusted by goodbye, the accuracy of the speech recognition result in the selected application scenario is 66.7%. If not necessary, after testing the accuracy of speech recognition and emotion tags in the currently selected application scenario, a new application scenario can be selected for verification, and the above speech recognition result test process can be repeated.
  • the speech recognition result test method mentioned above randomly selects user reply voice data based on a preset speech script in any application scenario, divides the user speech segment in the user reply speech data into multiple sub-segments of preset time length, and extracts each sub-segment. Acoustic features of the utterance, obtain the emotional label of each sub-segment based on the acoustic feature, linearly splice the emotional label with the user's reply voice data, and add the sub-segment identifier, and compare the corresponding voice recognition result of each sub-segment with standard voice recognition The comparison of the results shows that counting the proportion of sub-segments with consistent speech recognition results can efficiently and accurately verify the accuracy of the speech recognition results in the selected application scenario.
  • step S300 includes:
  • S340 Input the extracted acoustic features into a trained neural network model based on deep learning to obtain emotional labels.
  • Acoustic features can be further classified into temporal structure features, amplitude structure features, fundamental frequency structure features, and formant structure features.
  • the training has the correspondence between the above features and the corresponding emotional labels. relationship.
  • step S300 further includes:
  • S312 Obtain response voice sample data corresponding to different emotion tags.
  • S314 Extract the time structure feature, the amplitude structure feature, the fundamental frequency structure feature, and the formant structure feature from the reply voice sample data.
  • S316 Use the emotional label and the corresponding time structure feature, amplitude structure feature, fundamental frequency structure feature, and formant structure feature in the response voice sample data as training data, train the neural network model based on deep learning, and obtain the trained deep learning based The neural network model.
  • the extracted acoustic feature data is input into the above emotion label recognition model to obtain the emotion label corresponding to the sentence, and the emotion label is integrated with the reply voice data to obtain the speech recognition result.
  • training a neural network model based on deep learning to obtain a trained neural network model based on deep learning includes: extracting emotional labels from training data and corresponding time structure features, amplitude structure features, and fundamental frequency structure features And formant structure features; train the local emotion labels learned by the convolutional neural network part of the deep learning-based neural network according to the extracted feature data; abstract the local emotion labels through the recurrent neural network part of the convolutional neural network, And through the deep learning-based neural network pooling layer to learn the global emotional label, the trained neural network model based on deep learning is obtained.
  • extracting the acoustic features of each sub-segment, and obtaining the emotional label of each sub-segment according to the acoustic feature includes: extracting the acoustic feature of each sub-segment and the acoustic feature qualitative analysis table corresponding to the preset emotional label , To obtain the emotion label; among them, the acoustic feature qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic feature, and the qualitative analysis interval data of the acoustic feature corresponding to different emotion tags.
  • the acoustic features include speech rate, average fundamental frequency, and base frequency. Frequency range, intensity, sound quality, fundamental frequency change and clarity.
  • Different sentiment labels correspond to the qualitative analysis intervals of different acoustic features.
  • the qualitative analysis interval can be divided into several interval values in advance according to the acoustic feature type. For example, for the speech speed, it can be divided into fast, slightly faster, slightly slower, faster or faster. Slow, very slow. More specifically, the qualitative analysis of the speech rate, average fundamental frequency, fundamental frequency range, intensity, sound quality, fundamental frequency change, and intelligibility corresponding to the sentiment tags to be selected will obtain the qualitative analysis results, according to the current extracted sub-talk The acoustic characteristics of the segment and the corresponding qualitative analysis results are obtained, and the emotional label.
  • emotion label feature templates can be constructed respectively according to the qualitative analysis results corresponding to different emotion tags, and when emotion tag recognition is needed, the collected features are matched with the emotion tag feature templates to determine the emotion tags.
  • qualitative analysis includes: setting the speaking rate as fast, slightly faster, slightly slower, faster or slower, very slow, which can be based on historical sample data to obtain the average unit time corresponding to different emotion labels.
  • the number of words according to the average number of words per unit time corresponding to different emotion tags and the relative size relationship of the corresponding speech speed of different emotion tags, set the interval of the number of words per unit time corresponding to the qualitative judgment of different emotion tags.
  • the qualitative analysis level includes very high, very high, slightly low, very high, and very low; the fundamental frequency range includes very wide and slightly narrow; the intensity includes normal, high, and low; the sound quality includes: irregular, band Breathing, resonant, loud and muttering breathing; changes in fundamental frequency include: normal, accented syllable mutations, downward deformation, smooth upward deformation, downward change to the extreme; clarity includes: precise, nervous, Unclear, normal, normal.
  • after verifying the accuracy of speech recognition and emotion tags in the selected application scenario it further includes: delaying a preset time, returning to randomly selecting the user reply voice data based on the preset speech script in any application scenario step.
  • a voice recognition result test device the device includes:
  • the data acquisition module 100 is configured to randomly select user response voice data based on preset speech scripts in any application scenario;
  • the dividing module 200 is used to obtain the user segment in the user reply voice data, divide the user segment into a plurality of sub-segments of preset time length, and assign the sub-segment identifiers;
  • the feature extraction module 300 is configured to extract the acoustic features of each sub-segment, and obtain the emotional label of each sub-segment according to the acoustic features;
  • the splicing and combination module 400 is used to obtain the text data corresponding to each sub-segment by using speech recognition technology, to linearly splice the emotion label of each sub-segment with the corresponding text data, and to add the sub-segment identification between the emotion label and the text data In time, the speech recognition result of each sub-segment is obtained;
  • the test module 500 is used to compare the voice recognition results of each sub-segment with the voice recognition results of each sub-segment carried in the preset standard voice recognition result in the selected application scenario one by one according to the sub-segment identification, and count the voice recognition The proportion of the sub-segments with the same result shows the accuracy of the speech recognition result in the selected application scenario.
  • the above speech recognition result test device randomly selects user reply voice data based on a preset speech script in any application scenario, divides the user speech segment in the user reply speech data into multiple sub-segments of preset time length, and extracts each sub-segment. Acoustic features of the utterance, obtain the emotional label of each sub-segment based on the acoustic feature, linearly splice the emotional label with the user's reply voice data, and add the sub-segment identifier, and compare the corresponding voice recognition result of each sub-segment with standard voice recognition The comparison of the results shows that counting the proportion of sub-segments with consistent speech recognition results can efficiently and accurately verify the accuracy of the speech recognition results in the selected application scenario.
  • the feature extraction module 300 is also used to extract the acoustic features of each sub-segment; input the extracted acoustic features into a trained neural network model based on deep learning to obtain emotional tags.
  • the feature extraction module 300 is also used to obtain response voice sample data corresponding to different emotion tags; extract the time structure feature, amplitude structure feature, fundamental frequency structure feature, and formant structure feature of the response voice sample data; Reply to the emotional label and corresponding time structure feature, amplitude structure feature, fundamental frequency structure feature and formant structure feature in the voice sample data as training data, train the neural network model based on deep learning, and get the trained neural network based on deep learning model.
  • the feature extraction module 300 is also used to extract the emotional tags in the training data and the corresponding time structure features, amplitude structure features, fundamental frequency structure features, and formant structure features; training the neural network according to the extracted feature data Part of the convolutional neural network learns the local sentiment labels; through the recurrent neural network part of the convolutional neural network, the local sentiment labels are abstracted, and the global sentiment labels are learned through the pooling layer in the deep learning-based neural network. Trained neural network model based on deep learning.
  • the feature extraction module 600 is also used to extract the acoustic features of each sub-segment and the qualitative analysis results of the voice features corresponding to the preset emotion tags to obtain the emotion tags; wherein, the acoustic features corresponding to the preset emotion tags
  • the qualitative analysis table carries emotional tags, acoustic features, and qualitative analysis interval data corresponding to different emotional tags.
  • Acoustic features include speech rate, average fundamental frequency, fundamental frequency range, intensity, sound quality, fundamental frequency change, and clarity.
  • the above-mentioned speech recognition result test device further includes a loop test module for delaying a preset time, and controls the data acquisition module 100, the division module 200, the feature extraction module 300, the recognition result combination module 400, and the comparison test
  • the module 500 performs corresponding operations.
  • Each module in the above-mentioned speech recognition result test device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store preset speech scripts and historical expert data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a speech recognition result test method.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the method for testing the results of speech recognition.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors implement any one of the embodiments of the present application. Provide the steps of the test method for speech recognition results.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

一种语音识别结果测试方法,方法包括:随机选择任意应用场景下基于预设话术脚本的用户答复语音数据,将用户答复语音数据中用户话段分为多个预设时间长度的子话段,提取各子话段的声学特征,根据声学特征获取各子话段的情感标签,将情感标签与用户答复语音数据线性拼接,并且添加子话段标识,将各个子话段对应的语音识别结果与标准语音识别结果比较,计数语音识别结果一致的子话段占比,可以高效且准确验证已选择应用场景下语音识别结果的准确性。

Description

语音识别结果测试方法、装置、计算机设备和介质
相关申请的交叉引用
本申请要求于2019年07月23日提交中国专利局,申请号为2019106670546,申请名称为“语音识别结果测试方法、装置、计算机设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种语音识别结果测试方法、装置、计算机设备和存储介质。
背景技术
随着科学技术的发展,人工智能技术应用了越来越多的领域,给人们生产、生活带来便利,语音识别技术作为人工智能技术的重要组成部分也得到的日新月异的发展与应用。
在语音识别技术中,ASR(Automatic Speech Recognition,自动语音识别技术)是目前比较广泛使用的技术,具体来说,ASR是一种将人的语音转换为文本的技术。语音识别是一个多学科交叉的领域,它与声学、语音学、语言学、数字信号处理理论、信息论、计算机科学等众多学科紧密相连。由于语音信号的多样性和复杂性,语音识别系统只能在一定的限制条件下获得满意的性能且语音识别系统的性能多个因素。又由于在不同应用环境下多种因素情况不同,很容易造成在不同应用场景下ASR情感识别的正确率低的情况,若不对ASR进行验证,很容易造成语音识别出错,无法满足业务需求。
因此,有必要提供一种准确的语音识别结果测试方案。
发明内容
根据本申请公开的各种实施例,提供一种语音识别结果测试方法、装置、计算机设备和介质。
一种语音识别结果测试方法,包括:
随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准 语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
一种语音识别结果测试装置,包括:
数据获取模块,用于随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
划分模块,用于获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
特征提取模块,用于提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
拼接组合模块,用于采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
测试模块,用于根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应 的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中语音识别结果测试方法的流程示意图。
图2为再一个实施例中语音识别结果测试方法的流程示意图。
图3为又一个实施例中语音识别结果测试方法的流程示意图。
图4为根据一个或多个实施例中语音识别结果测试装置的框图。
图5为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,提供了一种语音识别结果测试方法,包括以下步骤:
S100:随机选择任意应用场景下基于预设话术脚本的用户答复语音数据。
预设话术脚本是基于不同应用场景下编写对话脚本数据,其具体包括问和答两部分数据,其模拟真实环境下客户与业务员(服务人员)之间的对话。非必要的,可以将不同应用场景下的话术脚本归集存储到一个数据库中,在该数据库中存储有不同应用场景下对应的话术脚本。应用场景包括贷款营销、催还款、贷款咨询等。服务器模拟在某一个应用场景下,基于预设话术脚本答复的问答语音数据。具体来说,针对需要验证的应用场景可以构建成一个应用场景集合,在应用场景集合中选择任意一个应用场景作为本轮测试场景。
S200:获取用户答复语音数据中用户话段,将用户话段分为多个预设时间长度的子话段,并分配子话段标识。
服务器对答复语音进行截取,将答复语音中用户话段划分为预设时间长度的子话段。具体的,预设时间长度比较小,例如3-5秒;即将用户话段分为3-5秒长度的子话段。
S300:提取各子话段的声学特征,根据声学特征获取各子话段的情感标签。
声学特征包括声波、信号以及语调等。情感标签包括中立、开心、伤心、生气、惊喜、害怕、厌恶、兴奋等。非必要的,可以设置预设时间间隔的窗口,以固定频率采集声学特征,构成声学特征集,根据声学特征集获取的情感标签。
S400:采用语音识别技术获取各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加子话段标识于情感标签与文本数据之间,得到各子话段的语音识别结果。
以每个子话段作为研究对象,将子话段的情感标签与对应的文本数据线性拼接,这个线性拼接过程可以理解为“+”的过程,即将两部分数据拼凑在一起,另外在两者之间添加子话段标识,以便后续能够准确区分出各子话段的语音识别结果。具体来说,线性拼接的过程可以简单理解为将文本数据拼接情感标签,例如某个子话段对应的文本数据是“可以”,情感标签是“开心”,该子话段标识为A,则得到的语音识别结果为“可以”A“开心”。
S500:根据子话段标识,将各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
标准语音识别结果是基于专家经验数据分析历史话术脚本得出的。其同样可以写入到预设话术脚本数据库中,即在预设话术脚本数据库内存储有话术脚本文件-各子话段对应的标准语音识别结果及其对应关系,在标准语音识别结果中携带有子话段对应文本数据、子话段标识符以及对应的情感标签。每个应用场景对应的预设话术脚本的用户答复语音数据中包括多个子话段,记录比较各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果一致的子话段数量,并且计算这部分子话段数量占整个用户答复语音数据包括子话段的比例,得到该比例即为已选择应用场景下语音识别结果的准确度。例如当前有3个子话段(实际情况远大于这个数量),得到各个子话段的语音识别结果为:你好A开心、不要B中立、再见C厌恶;对应的标准语音识别结果中包括:你好A中立、不要B中立、再见C厌恶,则得到已选择应用场景下语音识别结果的准确度为66.7%。非必要的,在测试完当前已选择应用场景下语音识别以及情感标签准确度之后,可以重新选择新的应用场景进行验证,重复上述语音识别结果测试过程。
上述语音识别结果测试方法,随机选择任意应用场景下基于预设话术脚本的用户答复语音数据,将用户答复语音数据中用户话段分为多个预设时间长度的子话段,提取各子话段的声学特征,根据声学特征获取各子话段的情感标签,将情感标签与用户答复语音数据线性拼接,并且添加子话段标识,将各个子话段对应的语音识别结果与标准语音识别结果比较,计数语音识别结果一致的子话段占比,可以高效且准确验证已选择应用场景下语音识别结果的准确性。
如图2所示,在其中一个实施例中,步骤S300包括:
S320:提取各子话段的声学特征。
S340:将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
声学特征进一步可以归类为时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征,在已训练的基于深度学习的神经网络模型中,训练得到有上述特征以及对应情感标签之间对应关系。
如图3所示,在其中一个实施例中,步骤S300还包括:
S312:获取不同情感标签对应的答复语音样本数据。
S314:提取答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征。
S316:将答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
当需要获取情感标签时,将提取的声学特征数据输入至上述情感标签识别模型,得到句子对应的情感标签,将情感标签与答复的语音数据整合,即得到语音识别结果。
在其中一个实施例中,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型包括:提取训练数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;根据提取的特征数据训练基于深度学习的神经网络中的卷积神经网络部分学习的局部情感标签;通过卷积神经网络中的循环神经网络部分、对局部情感标签进行抽象,并通过基于深度学习的神经网络中池化层学习到全局情感标签,得到已训练的基于深度学习的神经网络模型。
在其中一个实施例中,提取各子话段的声学特征,根据声学特征获取各子话段的情感标签包括:根据提取各子话段的声学特征以及预设情感标签对应的声学特征定性分析表,得到情感标签;其中,预设情感标签对应的声学特征定性分析表中携带有情感标签、声学特征以及不同情感标签对应声学特征的定性分析区间数据,声学特征包括语速、平均基频、基频范围、强度、音质、基频变化以及清晰度。
不同情感标签对应不同声学特征的定性分析区间,定性分析区间具体可以是根据声学特征类型预先划分几个区间值,例如针对语速,可以划分为很快、稍快、稍慢、较快或较慢、非常慢。更具体来说,针对待选情感标签对应的包括语速、平均基频、基频范围、强度、音质、基频变化以及清晰度情况定性分析,得到定性分析结果,根据当前提取的各子话段的声学特征以及对应的定性分析结果得到,情感标签。进一步的,可以根据不同情感标签对应的定性分析结果分别构建情感标签特征模板,当需要进行情感标签识别时,将采集到的特征与情感标签特征模板匹配,确定情感标签。在实际应用中,定性分析包括:语速设定为很快、稍快、稍慢、较快或较慢、非常慢,其具体可以根据历史样本数据,获取不同情感标签对应的单位时间内平均词语个数,根据不同情感标签对应的单位时间内平均词语个数以及不同情感标签对应语速相对大小关系,设定不同情感标签定性判定对应的单位时间内词语个数区间。下述针对平均基频、基频范围、强度、音质、基频变化以及清晰 度的判定都可以采用上述类似基于样本数据以及相对关系划设定性判定区间的方式实现平均基准基于采集的声音数据进行分析,其定性分析程度包括非常高、非常高、稍低、很高、非常低;基频范围包括很宽、稍窄;强度包括正常、较高、较低;音质包括:不规则、带呼吸声、引起共鸣的、带呼吸声响亮、嘟嚷;基频变化包括:正常、重读音节突变、向下变形、平滑向上变形、向下变到极点;清晰度包括:精确的、紧张的、不清楚的、正常、正常。其具体如下表格:
Figure PCTCN2019116960-appb-000001
在其中一个实施例中,验证已选择应用场景下语音识别以及情感标签准确性之后,还包括:延时预设时间,返回随机选择任意应用场景下基于预设话术脚本的用户答复语音数据的步骤。
在进行常规环境的下语音识别测试之外,还可以针对性进行噪声换将下语音测试,其具体可以采集已选择应用场景中在噪声环境下基于预设话术脚本的用户答复语音数据,将采集的用户答复语音数据作为检测参数重复上述测试过程,得到噪声环境下的语音识别测试。进一步的,还可以测试远距离条件下语音识别效果,其同样只需将远距离条件下采集的用户答复语音数据作为测试数据,重复上述测试过程实现。
应该理解的是,虽然图1-3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行
如图4所示,一种语音识别结果测试装置,装置包括:
数据获取模块100,用于随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
划分模块200,用于获取用户答复语音数据中用户话段,将用户话段分为多个预设时间长度的子话段,并分配子话段标识;
特征提取模块300,用于提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
拼接组合模块400,用于采用语音识别技术获取各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加子话段标识于情感标签与文本数据之间,得到各子话段的语音识别结果;
测试模块500,用于根据子话段标识,将各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
上述语音识别结果测试装置,随机选择任意应用场景下基于预设话术脚本的用户答复语音数据,将用户答复语音数据中用户话段分为多个预设时间长度的子话段,提取各子话段的声学特征,根据声学特征获取各子话段的情感标签,将情感标签与用户答复语音数据线性拼接,并且添加子话段标识,将各个子话段对应的语音识别结果与标准语音识别结果比较,计数语音识别结果一致的子话段占比,可以高效且准确验证已选择应用场景下语音识别结果的准确性。
在其中一个实施例中,特征提取模块300还用于提取各子话段的声学特征;将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
在其中一个实施例中,特征提取模块300还用于获取不同情感标签对应的答复语音样本数据;提取答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;将答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
在其中一个实施例中,特征提取模块300还用于提取训练数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;根据提取的特征数据训练神经网络中的卷积神经网络部分学习的局部情感标签;通过卷积神经网络中的循环神经网络部分、对局部情感标签进行抽象,并通过基于深度学习的神经网络中池化层学习到全局情感标签,得到已训练的基于深度学习的神经网络模型。
在其中一个实施例中,特征提取模块600还用于根据提取各子话段的声学特征以及预设情感标签对应的语音特征定性分析结果,得到情感标签;其中,预设情感标签对应的声学特征定性分析表中携带有情感标签、声学特征以及不同情感标签对应声学特征的定性分析区间数据,声学特征包括语速、平均基频、基频范围、强度、音质、基频变化以及清晰度。
在其中一个实施例中,上述语音识别结果测试装置还包括循环测试模块,用于延时预设时间,控制数据获取模块100、划分模块200、特征提取模块300、识别结果组合模块400以及比较测试模块500执行对应操作。
关于语音识别结果测试装置的具体限定可以参见上文中对于语音识别结果测试方法的限定,在此不再赘述。上述语音识别结果测试装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储预设话术脚本以及历史专家数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音识别结果测试方法。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的语音识别结果测试方法的步骤。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的语音识别结果测试方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接 RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音识别结果测试方法,包括:
    随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
    获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
    提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
    采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
    根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
  2. 根据权利要求1所述的方法,其特征在于,所述提取各子话段的声学特征,根据声学特征获取各子话段的情感标签包括:
    提取各子话段的声学特征;及
    将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    获取不同情感标签对应的答复语音样本数据;
    提取所述答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;及
    将所述答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
  4. 根据权利要求3所述的方法,其特征在于,所述训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型包括:
    提取所述训练数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;
    根据提取的特征数据训练基于深度学习的神经网络中的卷积神经网络部分学习的局部情感标签;及
    通过卷积神经网络中的循环神经网络部分、对所述局部情感标签进行抽象,并通过基于深度学习的神经网络中池化层学习到全局情感标签,得到已训练的基于深度学习的神经网络模型。
  5. 根据权利要求1所述的方法,其特征在于,所述提取各子话段的声学特征,根据声学特征获取各子话段的情感标签包括:
    根据提取各子话段的声学特征以及预设情感标签对应的声学特征定性分析表,得到情 感标签,其中,所述预设情感标签对应的声学特征定性分析表中携带有情感标签、声学特征以及不同情感标签对应声学特征的定性分析区间数据,所述声学特征包括语速、平均基频、基频范围、强度、音质、基频变化以及清晰度。
  6. 根据权利要求5所述的方法,其特征在于,所述根据提取各子话段的声学特征以及预设情感标签对应的声学特征定性分析表,得到情感标签包括:
    根据预设情感标签对应的声学特征定性分析表分别构建情感标签特征模板;及
    将提取各子话段的声学特征与所述情感标签特征模板匹配,得到情感标签。
  7. 根据权利要求1所述的方法,其特征在于,所述验证已选择应用场景下语音识别以及情感标签准确性之后,还包括:
    延时预设时间,返回所述随机选择任意应用场景下基于预设话术脚本的用户答复语音数据的步骤。
  8. 根据权利要求1所述的方法,所述随机选择任意应用场景下基于预设话术脚本的用户答复语音数据包括:
    获取需验证的应用场景;
    根据所述需验证的应用场景构建应用场景集合;及
    在所述应用场景集合中,随机选择任意应用场景下基于预设话术脚本的用户答复语音数据。
  9. 根据权利要求1所述的方法,所述提取各子话段的声学特征,根据声学特征获取各子话段的情感标签包括:
    获取预设时间间隔的窗口;
    根据所述预设时间间隔的窗口以固定频率采集各子话段的声学特征,构成声学特征集;及
    根据所述声学特征集,获取各子话段的情感标签。
  10. 一种语音识别结果测试装置,包括:
    数据获取模块,用于随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
    划分模块,用于获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
    特征提取模块,用于提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
    拼接组合模块,用于采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
    测试模块,用于根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别 结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
  11. 根据权利要求10所述的装置,其特征在于,所述特征提取模块还用于提取各子话段的声学特征;及将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
  12. 根据权利要求10所述的装置,其特征在于,所述特征提取模块还用于获取不同情感标签对应的答复语音样本数据;提取所述答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;及将所述答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
  13. 根据权利要求10所述的装置,其特征在于,所述特征提取模块还用于提取所述训练数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;根据提取的特征数据训练基于深度学习的神经网络中的卷积神经网络部分学习的局部情感标签;及通过卷积神经网络中的循环神经网络部分、对所述局部情感标签进行抽象,并通过基于深度学习的神经网络中池化层学习到全局情感标签,得到已训练的基于深度学习的神经网络模型。
  14. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
    获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
    提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
    采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
    根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    提取各子话段的声学特征;及
    将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
  16. 根据权利要求14所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取不同情感标签对应的答复语音样本数据;
    提取所述答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;及
    将所述答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
  17. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    随机选择任意应用场景下基于预设话术脚本的用户答复语音数据;
    获取所述用户答复语音数据中用户话段,将所述用户话段分为多个预设时间长度的子话段,并分配子话段标识;
    提取各子话段的声学特征,根据声学特征获取各子话段的情感标签;
    采用语音识别技术获取所述各子话段对应的文本数据,将各子话段的情感标签与对应的文本数据线性拼接,并添加所述子话段标识于所述情感标签与所述文本数据之间,得到各子话段的语音识别结果;及
    根据所述子话段标识,将所述各子话段的语音识别结果与已选择应用场景下预设标准语音识别结果中携带的各子话段的语音识别结果逐一对比,计数语音识别结果一致的子话段占比,得到已选择应用场景下语音识别结果的准确度。
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    提取各子话段的声学特征;及
    将提取的声学特征输入已训练的基于深度学习的神经网络模型,得到情感标签。
  19. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    获取不同情感标签对应的答复语音样本数据;
    提取所述答复语音样本数据中时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;及
    将所述答复语音样本数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征作为训练数据,训练基于深度学习的神经网络模型,得到已训练的基于深度学习的神经网络模型。
  20. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    提取所述训练数据中情感标签以及对应的时间构造特征、振幅构造特征、基频构造特征以及共振峰构造特征;
    根据提取的特征数据训练基于深度学习的神经网络中的卷积神经网络部分学习的局部情感标签;及
    通过卷积神经网络中的循环神经网络部分、对所述局部情感标签进行抽象,并通过基于深度学习的神经网络中池化层学习到全局情感标签,得到已训练的基于深度学习的神经网络模型。
PCT/CN2019/116960 2019-07-23 2019-11-11 语音识别结果测试方法、装置、计算机设备和介质 WO2021012495A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910667054.6A CN110556098B (zh) 2019-07-23 2019-07-23 语音识别结果测试方法、装置、计算机设备和介质
CN201910667054.6 2019-07-23

Publications (1)

Publication Number Publication Date
WO2021012495A1 true WO2021012495A1 (zh) 2021-01-28

Family

ID=68735961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116960 WO2021012495A1 (zh) 2019-07-23 2019-11-11 语音识别结果测试方法、装置、计算机设备和介质

Country Status (2)

Country Link
CN (1) CN110556098B (zh)
WO (1) WO2021012495A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021134550A1 (zh) * 2019-12-31 2021-07-08 李庆远 多个语音识别输出的人类合并和训练
CN111522943A (zh) * 2020-03-25 2020-08-11 平安普惠企业管理有限公司 逻辑节点的自动化测试方法、装置、设备及存储介质
CN112349290B (zh) * 2021-01-08 2021-04-20 北京海天瑞声科技股份有限公司 一种基于三元组的语音识别准确率计算方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295607A1 (en) * 2010-05-31 2011-12-01 Akash Krishnan System and Method for Recognizing Emotional State from a Speech Signal
CN104464757A (zh) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 语音评测方法和语音评测装置
CN105741832A (zh) * 2016-01-27 2016-07-06 广东外语外贸大学 一种基于深度学习的口语评测方法和系统
CN107767881A (zh) * 2016-08-15 2018-03-06 中国移动通信有限公司研究院 一种语音信息的满意度的获取方法和装置
CN109272993A (zh) * 2018-08-21 2019-01-25 中国平安人寿保险股份有限公司 语音类别的识别方法、装置、计算机设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9870765B2 (en) * 2016-06-03 2018-01-16 International Business Machines Corporation Detecting customers with low speech recognition accuracy by investigating consistency of conversation in call-center
CN106548772A (zh) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 语音识别测试系统及方法
CN108538296A (zh) * 2017-03-01 2018-09-14 广东神马搜索科技有限公司 语音识别测试方法及测试终端
CN107086040B (zh) * 2017-06-23 2021-03-02 歌尔股份有限公司 语音识别能力测试方法和装置
CN107452404A (zh) * 2017-07-31 2017-12-08 哈尔滨理工大学 语音情感识别的优选方法
CN108777141B (zh) * 2018-05-31 2022-01-25 康键信息技术(深圳)有限公司 测试装置、测试的方法及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295607A1 (en) * 2010-05-31 2011-12-01 Akash Krishnan System and Method for Recognizing Emotional State from a Speech Signal
CN104464757A (zh) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 语音评测方法和语音评测装置
CN105741832A (zh) * 2016-01-27 2016-07-06 广东外语外贸大学 一种基于深度学习的口语评测方法和系统
CN107767881A (zh) * 2016-08-15 2018-03-06 中国移动通信有限公司研究院 一种语音信息的满意度的获取方法和装置
CN109272993A (zh) * 2018-08-21 2019-01-25 中国平安人寿保险股份有限公司 语音类别的识别方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN110556098A (zh) 2019-12-10
CN110556098B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
WO2021128741A1 (zh) 语音情绪波动分析方法、装置、计算机设备及存储介质
US10706873B2 (en) Real-time speaker state analytics platform
WO2021164147A1 (zh) 基于人工智能的服务评价方法、装置、设备及存储介质
US10043517B2 (en) Audio-based event interaction analytics
WO2021051607A1 (zh) 视频数据的欺诈检测方法、装置、计算机设备和存储介质
US20150095031A1 (en) System and method for crowdsourcing of word pronunciation verification
WO2021012495A1 (zh) 语音识别结果测试方法、装置、计算机设备和介质
US8447603B2 (en) Rating speech naturalness of speech utterances based on a plurality of human testers
CN104903954A (zh) 使用基于人工神经网络的亚语音单位区分的说话人验证及识别
US10755595B1 (en) Systems and methods for natural language processing for speech content scoring
US11354754B2 (en) Generating self-support metrics based on paralinguistic information
US20140195239A1 (en) Systems and Methods for an Automated Pronunciation Assessment System for Similar Vowel Pairs
US20230177835A1 (en) Relationship modeling and key feature detection based on video data
US10283142B1 (en) Processor-implemented systems and methods for determining sound quality
WO2020056995A1 (zh) 语音流利度识别方法、装置、计算机设备及可读存储介质
CN111901627B (zh) 视频处理方法、装置、存储介质及电子设备
Kopparapu Non-linguistic analysis of call center conversations
CN112966082A (zh) 音频质检方法、装置、设备以及存储介质
CN110782902A (zh) 音频数据确定方法、装置、设备和介质
KR20210071713A (ko) 스피치 스킬 피드백 시스템
CN104700831B (zh) 分析音频文件的语音特征的方法和装置
CN109408175B (zh) 通用高性能深度学习计算引擎中的实时交互方法及系统
CN112434953A (zh) 一种基于计算机数据处理的客服人员考核方法和装置
Szekrényes et al. Classification of formal and informal dialogues based on turn-taking and intonation using deep neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19938799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19938799

Country of ref document: EP

Kind code of ref document: A1