CN110556098B

CN110556098B - Voice recognition result testing method and device, computer equipment and medium

Info

Publication number: CN110556098B
Application number: CN201910667054.6A
Authority: CN
Inventors: 刘丽珍; 吕小立
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2023-04-18
Anticipated expiration: 2039-07-23
Also published as: CN110556098A; WO2021012495A1

Abstract

The application relates to the technical field of artificial intelligence, is applied to the voice recognition industry, and provides a voice recognition result testing method, a device, computer equipment and a storage medium, wherein user reply voice data based on a preset voice script under any application scene is randomly selected, a user voice segment in the user reply voice data is divided into a plurality of sub voice segments with preset time length, the acoustic characteristics of each sub voice segment are extracted, the emotion label of each sub voice segment is obtained according to the acoustic characteristics, the emotion label is linearly spliced with the user reply voice data, the sub voice segment identification is added, the voice recognition results corresponding to each sub voice segment are compared with standard voice recognition results, the sub voice segment occupation ratios with consistent voice recognition results are counted, and the accuracy of the voice recognition results under the selected application scene can be efficiently and accurately verified.

Description

Voice recognition result testing method and device, computer equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for testing speech recognition results, a computer device, and a storage medium.

Background

With the development of scientific technology, artificial intelligence technology is applied to more and more fields, convenience is brought to production and life of people, and the speech recognition technology is also developed and applied in the future as an important component of the artificial intelligence technology.

Among Speech Recognition technologies, ASR (Automatic Speech Recognition) is a technology that is currently used in a relatively wide range, and particularly, ASR is a technology that converts human Speech into text. Speech recognition is a multidisciplinary intersection field that is tightly connected to many disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints and the performance of speech recognition systems is a multiple of a factor. And because the situation of various factors is different under different application environments, the situation that the accuracy of ASR emotion recognition is low under different application scenes is easily caused, and if the ASR is not verified, a voice recognition error is easily caused, so that the service requirement cannot be met.

Therefore, it is necessary to provide an accurate speech recognition result test scheme.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for testing accurate speech recognition results.

A method of testing speech recognition results, the method comprising:

randomly selecting user reply voice data based on a preset dialogues script under any application scene;

acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;

extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;

acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.

In one embodiment, the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:

extracting acoustic features of each sub-speech segment;

and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.

In one embodiment, the testing the speech recognition result further includes:

acquiring reply voice sample data corresponding to different emotion tags;

extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data;

and training a deep learning-based neural network model by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained deep learning-based neural network model.

In one embodiment, the training of the deep learning based neural network model, and the obtaining of the trained deep learning based neural network model includes:

extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics;

training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data;

abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning a global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.

In one embodiment, the extracting the acoustic features of the sub-speech segments and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:

obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels;

the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, wherein the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.

In one embodiment, after verifying the accuracy of the speech recognition and the emotion tag in the selected application scenario, the method further includes:

delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues in any application scene.

A speech recognition result testing apparatus, the apparatus comprising:

the data acquisition module is used for randomly selecting user reply voice data based on a preset talk script in any application scene;

the dividing module is used for acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length and distributing sub-speech segment identifiers;

the feature extraction module is used for extracting the acoustic features of the sub-speech segments and acquiring the emotion labels of the sub-speech segments according to the acoustic features;

the splicing combination module is used for acquiring the text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;

and the test module is used for comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, counting the sub-speech segment occupation ratios with consistent voice recognition results and obtaining the accuracy of the voice recognition result in the selected application scene.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.

According to the voice recognition result testing method, the voice recognition result testing device, the computer equipment and the storage medium, the user reply voice data based on the preset dialogues script under any application scene is randomly selected, the user dialog segment in the user reply voice data is divided into the multiple sub-dialog segments with the preset time length, the acoustic feature of each sub-dialog segment is extracted, the emotion label of each sub-dialog segment is obtained according to the acoustic feature, the emotion label is linearly spliced with the user reply voice data, the sub-dialog segment identifier is added, the voice recognition result corresponding to each sub-dialog segment is compared with the standard voice recognition result, the proportion of the sub-dialog segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.

Drawings

FIG. 1 is a flow diagram illustrating a method for testing speech recognition results in one embodiment;

FIG. 2 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;

FIG. 3 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;

FIG. 4 is a block diagram showing the structure of a speech recognition result testing apparatus according to an embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As shown in fig. 1, there is provided a speech recognition result testing method, including the steps of:

s100: and randomly selecting the user reply voice data based on the preset dialogues in any application scene.

The preset dialogue script is based on the dialogue script data written under different application scenes, and specifically comprises question and answer data which simulates the dialogue between a client and an operator (service staff) under a real environment. Optionally, the conversational scripts in different application scenarios may be collected and stored in a database, and the corresponding conversational scripts in different application scenarios are stored in the database. The application scenes comprise loan marketing, payment urging, loan consultation and the like. The server simulates question and answer voice data responded based on a preset conversational script in a certain application scene. Specifically, an application scenario set can be constructed for application scenarios needing to be verified, and any one application scenario in the application scenario set is selected as a current test scenario.

S200: and acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers.

The server intercepts the reply voice and divides the user voice segment in the reply voice into sub voice segments with preset time length. Specifically, the preset time length is relatively short, such as 3-5 seconds; i.e. the user speech segment is divided into sub-speech segments of 3-5 seconds in length.

S300: and extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features.

The acoustic features include sound waves, signals, tones, and the like. The emotion labels comprise neutral, happy, wounded, angry, surprise, fear, aversion to, excitement and the like. Optionally, a window with a preset time interval can be set, acoustic features are collected at a fixed frequency to form an acoustic feature set, and the emotion label is obtained according to the acoustic feature set.

S400: and acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section.

Each sub-speech segment is taken as a research object, the emotion labels of the sub-speech segments are linearly spliced with the corresponding text data, the linear splicing process can be understood as a plus process, namely two parts of data are spliced together, and a sub-speech segment identifier is added between the two parts of data, so that the voice recognition result of each sub-speech segment can be accurately distinguished in the following process. Specifically, the process of linear concatenation may be simply understood as concatenating text data with an emotion tag, for example, if text data corresponding to a certain sub-speech segment is "ok", the emotion tag is "happy", the sub-speech segment is identified as a, and the obtained speech recognition result is "happy with" a ".

S500: and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.

The standard speech recognition results are derived from analyzing the dialogies script based on expert empirical data. The method can also be written into a preset conversation script database, namely, a conversation script file-standard voice recognition result corresponding to each sub-conversation segment and a corresponding relation of the standard voice recognition result are stored in the preset conversation script database, and text data corresponding to the sub-conversation segments, sub-conversation segment identifiers and corresponding emotion labels are carried in the standard voice recognition result. The user response voice data of the preset voice script corresponding to each application scene comprises a plurality of sub-voice segments, the number of the sub-voice segments with the voice recognition results of the sub-voice segments consistent with the voice recognition results of the sub-voice segments carried in the preset standard voice recognition result in the selected application scene is recorded and compared, the proportion of the number of the sub-voice segments in the whole user response voice data comprising the sub-voice segments is calculated, and the obtained proportion is the accuracy of the voice recognition result in the selected application scene. For example, there are currently 3 sub-speech segments (the actual situation is much larger than this number), and the speech recognition result of each sub-speech segment is obtained as follows: you are happy with A, don't want B neutral, and then see C aversion; the corresponding standard speech recognition results include: you are good at neutral A, do not want neutral B, and then see C dislike, the accuracy of obtaining the speech recognition result under the selected application scene is 66.7%. Optionally, after testing the accuracy of the speech recognition and emotion tag in the currently selected application scenario, a new application scenario may be reselected for verification, and the speech recognition result testing process may be repeated.

According to the voice recognition result testing method, the user reply voice data based on the preset voice script under any application scene is randomly selected, the user voice section in the user reply voice data is divided into a plurality of sub-voice sections with preset time length, the acoustic feature of each sub-voice section is extracted, the emotion label of each sub-voice section is obtained according to the acoustic feature, the emotion label and the user reply voice data are linearly spliced, the sub-voice section mark is added, the voice recognition result corresponding to each sub-voice section is compared with the standard voice recognition result, the proportion of the sub-voice sections with consistent voice recognition results is counted, and the accuracy of the voice recognition result under the selected application scene can be efficiently and accurately verified.

As shown in fig. 2, in one embodiment, step S300 includes:

s320: and extracting the acoustic features of each sub-speech segment.

S340: and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.

The acoustic features can be further classified into time structure features, amplitude structure features, fundamental frequency structure features and formant structure features, and the corresponding relation between the features and corresponding emotion labels is obtained through training in a trained neural network model based on deep learning.

As shown in fig. 3, in one embodiment, step S300 further includes:

s312: and acquiring reply voice sample data corresponding to different emotion labels.

S314: and extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data.

S316: and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.

And when the emotion tag needs to be acquired, inputting the extracted acoustic feature data into the emotion tag identification model to obtain the emotion tag corresponding to the sentence, and integrating the emotion tag and the responded voice data to obtain a voice identification result.

In one embodiment, training the deep learning based neural network model, and obtaining the trained deep learning based neural network model comprises: extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.

In one embodiment, extracting the acoustic features of each sub-speech segment, and obtaining the emotion label of each sub-speech segment according to the acoustic features comprises: obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.

The different emotion javelins correspond to the qualitative analysis intervals of different acoustic features, and the qualitative analysis intervals can be specifically divided into several interval values in advance according to the acoustic feature types, for example, the interval values can be divided into fast interval values, slightly slow interval values, fast interval values, slow interval values or slow interval values according to the speech speed. More specifically, qualitative analysis is performed on the conditions, including the speech speed, the average fundamental frequency, the fundamental frequency range, the intensity, the tone quality, the fundamental frequency change and the definition, corresponding to the emotion tag to be selected, so as to obtain a qualitative analysis result, and the emotion tag is obtained according to the acoustic features of each currently extracted sub-speech segment and the corresponding qualitative analysis result. Furthermore, emotion label feature templates can be respectively constructed according to qualitative analysis results corresponding to different emotion labels, and when emotion label identification is required, the collected features are matched with the emotion label feature templates to determine the emotion labels. In practical applications, the qualitative analysis includes: the speed of speech is set to be very fast, slightly slow, fast or slow, very slow, and specifically, the average number of words in unit time corresponding to different emotion tags can be obtained according to historical sample data, and the number of words in unit time corresponding to different emotion tags and the relative size relationship of the speed of speech corresponding to different emotion tags are set to qualitatively determine the corresponding number of words in unit time interval according to different emotion tags. The following judgment aiming at average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition can adopt the similar mode of drawing a set judgment interval based on sample data and relative relation to realize that the average reference is analyzed based on the collected sound data, and the qualitative analysis degree comprises very high, slightly low, very high and very low; the fundamental frequency range includes very wide and narrow; the strength comprises normal, higher and lower; the tone quality includes: irregular, with breathing sound, causing resonance, with breathing sound loud, beep claim; the fundamental frequency variation includes: normal, stressed syllable mutation, downward deformation, smooth upward deformation, downward deformation to the extreme; the definitions include: precise, tense, unclear, normal, and normal. It is specified in the following table:

in one embodiment, after verifying the accuracy of speech recognition and emotion tag in the selected application scenario, the method further includes: and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.

Besides the voice recognition test under the conventional environment, the voice recognition test under the noise can be performed in a targeted manner, specifically, the voice recognition test under the noise environment can be obtained by collecting user response voice data based on a preset dialogues script under the noise environment in a selected application scene, and repeating the test process by using the collected user response voice data as a detection parameter. Furthermore, the voice recognition effect under the remote condition can be tested, and the method can be realized by only taking the user reply voice data collected under the remote condition as the test data and repeating the test process.

It should be understood that although the various steps in the flow diagrams of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps

As shown in fig. 4, a speech recognition result testing apparatus includes:

the data acquisition module 100 is configured to randomly select user reply voice data based on a preset dialogues script in any application scenario;

a dividing module 200, configured to obtain a user speech segment in the user reply speech data, divide the user speech segment into multiple sub-speech segments with preset time lengths, and allocate sub-speech segment identifiers;

the feature extraction module 300 is configured to extract acoustic features of each sub-speech segment, and obtain emotion tags of each sub-speech segment according to the acoustic features;

the splicing and combining module 400 is configured to acquire text data corresponding to each sub-session by using a speech recognition technology, linearly splice the emotion tag of each sub-session with the corresponding text data, and add the sub-session identifier between the emotion tag and the text data to obtain a speech recognition result of each sub-session;

and the test module 500 is configured to compare the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, count the sub-speech segment occupation ratios with the consistent voice recognition result, and obtain the accuracy of the voice recognition result in the selected application scene.

According to the voice recognition result testing device, the user response voice data based on the preset voice script under any application scene is randomly selected, the user voice segment in the user response voice data is divided into a plurality of sub-voice segments with preset time length, the acoustic feature of each sub-voice segment is extracted, the emotion label of each sub-voice segment is obtained according to the acoustic feature, the emotion label and the user response voice data are linearly spliced, the sub-voice segment identifier is added, the voice recognition result corresponding to each sub-voice segment is compared with the standard voice recognition result, the proportion of the sub-voice segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.

In one embodiment, the feature extraction module 300 is further configured to extract acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.

In one embodiment, the feature extraction module 300 is further configured to obtain reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time structural features, amplitude structural features, fundamental frequency structural features and formant structural features as training data to obtain the trained neural network model based on deep learning.

In one embodiment, the feature extraction module 300 is further configured to extract emotion labels in the training data and corresponding time structure features, amplitude structure features, fundamental frequency structure features, and formant structure features; training local emotion labels learned by a convolutional neural network part in the neural network according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.

In one embodiment, the feature extraction module 600 is further configured to obtain an emotion tag according to the extracted acoustic features of each sub-speech segment and the voice feature qualitative analysis result corresponding to the preset emotion tag; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.

In one embodiment, the voice recognition result testing apparatus further includes a loop testing module, configured to delay a preset time, and control the data obtaining module 100, the dividing module 200, the feature extracting module 300, the recognition result combining module 400, and the comparison testing module 500 to execute corresponding operations.

For the specific definition of the speech recognition result testing device, reference may be made to the above definition of the speech recognition result testing method, and details are not described herein again. All or part of the modules in the voice recognition result testing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing preset dialoging scripts and historical expert data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition result testing method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring a user speech segment in user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;

acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding a sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment;

and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment proportion with the consistent voice recognition result to obtain the accuracy of the voice recognition result in the selected application scene.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

extracting acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.

acquiring reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.

extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.

obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.

and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding a sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;

In one embodiment, the computer program when executed by the processor further performs the steps of:

acquiring reply voice sample data corresponding to different emotion labels; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.

extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is instructed by a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of testing speech recognition results, the method comprising:

extracting the acoustic features of the sub-speech sections, and acquiring the emotion labels of the sub-speech sections according to the acoustic features;

acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the proportion of the sub-speech segments with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.

2. The method of claim 1, wherein the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:

extracting acoustic features of each sub-speech segment;

3. The method of claim 2, further comprising:

acquiring reply voice sample data corresponding to different emotion labels;

extracting time construction features, amplitude construction features, fundamental frequency construction features and formant construction features in the reply voice sample data;

4. The method of claim 3, wherein training the deep learning based neural network model, and wherein obtaining the trained deep learning based neural network model comprises:

training a convolutional neural network part and a learned local emotion label in the neural network based on deep learning according to the extracted feature data;

5. The method of claim 1, wherein the extracting the acoustic features of each sub-speech segment, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:

the acoustic feature qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic features and qualitative analysis interval data of the acoustic features corresponding to different emotion tags, wherein the acoustic features comprise speech speed, average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.

6. The method of claim 1, after obtaining the accuracy of the speech recognition result in the selected application scenario, further comprising:

delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues script under any application scene.

7. A speech recognition result testing apparatus, comprising:

8. The apparatus of claim 7, wherein the feature extraction module is further configured to extract acoustic features of each sub-speech segment, and input the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.