CN110556098B - Voice recognition result testing method and device, computer equipment and medium - Google Patents

Voice recognition result testing method and device, computer equipment and medium Download PDF

Info

Publication number
CN110556098B
CN110556098B CN201910667054.6A CN201910667054A CN110556098B CN 110556098 B CN110556098 B CN 110556098B CN 201910667054 A CN201910667054 A CN 201910667054A CN 110556098 B CN110556098 B CN 110556098B
Authority
CN
China
Prior art keywords
sub
speech
emotion
voice
recognition result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910667054.6A
Other languages
Chinese (zh)
Other versions
CN110556098A (en
Inventor
刘丽珍
吕小立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910667054.6A priority Critical patent/CN110556098B/en
Priority to PCT/CN2019/116960 priority patent/WO2021012495A1/en
Publication of CN110556098A publication Critical patent/CN110556098A/en
Application granted granted Critical
Publication of CN110556098B publication Critical patent/CN110556098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The application relates to the technical field of artificial intelligence, is applied to the voice recognition industry, and provides a voice recognition result testing method, a device, computer equipment and a storage medium, wherein user reply voice data based on a preset voice script under any application scene is randomly selected, a user voice segment in the user reply voice data is divided into a plurality of sub voice segments with preset time length, the acoustic characteristics of each sub voice segment are extracted, the emotion label of each sub voice segment is obtained according to the acoustic characteristics, the emotion label is linearly spliced with the user reply voice data, the sub voice segment identification is added, the voice recognition results corresponding to each sub voice segment are compared with standard voice recognition results, the sub voice segment occupation ratios with consistent voice recognition results are counted, and the accuracy of the voice recognition results under the selected application scene can be efficiently and accurately verified.

Description

Voice recognition result testing method and device, computer equipment and medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for testing speech recognition results, a computer device, and a storage medium.
Background
With the development of scientific technology, artificial intelligence technology is applied to more and more fields, convenience is brought to production and life of people, and the speech recognition technology is also developed and applied in the future as an important component of the artificial intelligence technology.
Among Speech Recognition technologies, ASR (Automatic Speech Recognition) is a technology that is currently used in a relatively wide range, and particularly, ASR is a technology that converts human Speech into text. Speech recognition is a multidisciplinary intersection field that is tightly connected to many disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints and the performance of speech recognition systems is a multiple of a factor. And because the situation of various factors is different under different application environments, the situation that the accuracy of ASR emotion recognition is low under different application scenes is easily caused, and if the ASR is not verified, a voice recognition error is easily caused, so that the service requirement cannot be met.
Therefore, it is necessary to provide an accurate speech recognition result test scheme.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for testing accurate speech recognition results.
A method of testing speech recognition results, the method comprising:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:
extracting acoustic features of each sub-speech segment;
and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the testing the speech recognition result further includes:
acquiring reply voice sample data corresponding to different emotion tags;
extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data;
and training a deep learning-based neural network model by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained deep learning-based neural network model.
In one embodiment, the training of the deep learning based neural network model, and the obtaining of the trained deep learning based neural network model includes:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics;
training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data;
abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning a global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the extracting the acoustic features of the sub-speech segments and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels;
the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, wherein the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, after verifying the accuracy of the speech recognition and the emotion tag in the selected application scenario, the method further includes:
delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues in any application scene.
A speech recognition result testing apparatus, the apparatus comprising:
the data acquisition module is used for randomly selecting user reply voice data based on a preset talk script in any application scene;
the dividing module is used for acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length and distributing sub-speech segment identifiers;
the feature extraction module is used for extracting the acoustic features of the sub-speech segments and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
the splicing combination module is used for acquiring the text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and the test module is used for comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, counting the sub-speech segment occupation ratios with consistent voice recognition results and obtaining the accuracy of the voice recognition result in the selected application scene.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.
According to the voice recognition result testing method, the voice recognition result testing device, the computer equipment and the storage medium, the user reply voice data based on the preset dialogues script under any application scene is randomly selected, the user dialog segment in the user reply voice data is divided into the multiple sub-dialog segments with the preset time length, the acoustic feature of each sub-dialog segment is extracted, the emotion label of each sub-dialog segment is obtained according to the acoustic feature, the emotion label is linearly spliced with the user reply voice data, the sub-dialog segment identifier is added, the voice recognition result corresponding to each sub-dialog segment is compared with the standard voice recognition result, the proportion of the sub-dialog segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.
Drawings
FIG. 1 is a flow diagram illustrating a method for testing speech recognition results in one embodiment;
FIG. 2 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;
FIG. 3 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;
FIG. 4 is a block diagram showing the structure of a speech recognition result testing apparatus according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As shown in fig. 1, there is provided a speech recognition result testing method, including the steps of:
s100: and randomly selecting the user reply voice data based on the preset dialogues in any application scene.
The preset dialogue script is based on the dialogue script data written under different application scenes, and specifically comprises question and answer data which simulates the dialogue between a client and an operator (service staff) under a real environment. Optionally, the conversational scripts in different application scenarios may be collected and stored in a database, and the corresponding conversational scripts in different application scenarios are stored in the database. The application scenes comprise loan marketing, payment urging, loan consultation and the like. The server simulates question and answer voice data responded based on a preset conversational script in a certain application scene. Specifically, an application scenario set can be constructed for application scenarios needing to be verified, and any one application scenario in the application scenario set is selected as a current test scenario.
S200: and acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers.
The server intercepts the reply voice and divides the user voice segment in the reply voice into sub voice segments with preset time length. Specifically, the preset time length is relatively short, such as 3-5 seconds; i.e. the user speech segment is divided into sub-speech segments of 3-5 seconds in length.
S300: and extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features.
The acoustic features include sound waves, signals, tones, and the like. The emotion labels comprise neutral, happy, wounded, angry, surprise, fear, aversion to, excitement and the like. Optionally, a window with a preset time interval can be set, acoustic features are collected at a fixed frequency to form an acoustic feature set, and the emotion label is obtained according to the acoustic feature set.
S400: and acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section.
Each sub-speech segment is taken as a research object, the emotion labels of the sub-speech segments are linearly spliced with the corresponding text data, the linear splicing process can be understood as a plus process, namely two parts of data are spliced together, and a sub-speech segment identifier is added between the two parts of data, so that the voice recognition result of each sub-speech segment can be accurately distinguished in the following process. Specifically, the process of linear concatenation may be simply understood as concatenating text data with an emotion tag, for example, if text data corresponding to a certain sub-speech segment is "ok", the emotion tag is "happy", the sub-speech segment is identified as a, and the obtained speech recognition result is "happy with" a ".
S500: and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
The standard speech recognition results are derived from analyzing the dialogies script based on expert empirical data. The method can also be written into a preset conversation script database, namely, a conversation script file-standard voice recognition result corresponding to each sub-conversation segment and a corresponding relation of the standard voice recognition result are stored in the preset conversation script database, and text data corresponding to the sub-conversation segments, sub-conversation segment identifiers and corresponding emotion labels are carried in the standard voice recognition result. The user response voice data of the preset voice script corresponding to each application scene comprises a plurality of sub-voice segments, the number of the sub-voice segments with the voice recognition results of the sub-voice segments consistent with the voice recognition results of the sub-voice segments carried in the preset standard voice recognition result in the selected application scene is recorded and compared, the proportion of the number of the sub-voice segments in the whole user response voice data comprising the sub-voice segments is calculated, and the obtained proportion is the accuracy of the voice recognition result in the selected application scene. For example, there are currently 3 sub-speech segments (the actual situation is much larger than this number), and the speech recognition result of each sub-speech segment is obtained as follows: you are happy with A, don't want B neutral, and then see C aversion; the corresponding standard speech recognition results include: you are good at neutral A, do not want neutral B, and then see C dislike, the accuracy of obtaining the speech recognition result under the selected application scene is 66.7%. Optionally, after testing the accuracy of the speech recognition and emotion tag in the currently selected application scenario, a new application scenario may be reselected for verification, and the speech recognition result testing process may be repeated.
According to the voice recognition result testing method, the user reply voice data based on the preset voice script under any application scene is randomly selected, the user voice section in the user reply voice data is divided into a plurality of sub-voice sections with preset time length, the acoustic feature of each sub-voice section is extracted, the emotion label of each sub-voice section is obtained according to the acoustic feature, the emotion label and the user reply voice data are linearly spliced, the sub-voice section mark is added, the voice recognition result corresponding to each sub-voice section is compared with the standard voice recognition result, the proportion of the sub-voice sections with consistent voice recognition results is counted, and the accuracy of the voice recognition result under the selected application scene can be efficiently and accurately verified.
As shown in fig. 2, in one embodiment, step S300 includes:
s320: and extracting the acoustic features of each sub-speech segment.
S340: and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
The acoustic features can be further classified into time structure features, amplitude structure features, fundamental frequency structure features and formant structure features, and the corresponding relation between the features and corresponding emotion labels is obtained through training in a trained neural network model based on deep learning.
As shown in fig. 3, in one embodiment, step S300 further includes:
s312: and acquiring reply voice sample data corresponding to different emotion labels.
S314: and extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data.
S316: and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
And when the emotion tag needs to be acquired, inputting the extracted acoustic feature data into the emotion tag identification model to obtain the emotion tag corresponding to the sentence, and integrating the emotion tag and the responded voice data to obtain a voice identification result.
In one embodiment, training the deep learning based neural network model, and obtaining the trained deep learning based neural network model comprises: extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, extracting the acoustic features of each sub-speech segment, and obtaining the emotion label of each sub-speech segment according to the acoustic features comprises: obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
The different emotion javelins correspond to the qualitative analysis intervals of different acoustic features, and the qualitative analysis intervals can be specifically divided into several interval values in advance according to the acoustic feature types, for example, the interval values can be divided into fast interval values, slightly slow interval values, fast interval values, slow interval values or slow interval values according to the speech speed. More specifically, qualitative analysis is performed on the conditions, including the speech speed, the average fundamental frequency, the fundamental frequency range, the intensity, the tone quality, the fundamental frequency change and the definition, corresponding to the emotion tag to be selected, so as to obtain a qualitative analysis result, and the emotion tag is obtained according to the acoustic features of each currently extracted sub-speech segment and the corresponding qualitative analysis result. Furthermore, emotion label feature templates can be respectively constructed according to qualitative analysis results corresponding to different emotion labels, and when emotion label identification is required, the collected features are matched with the emotion label feature templates to determine the emotion labels. In practical applications, the qualitative analysis includes: the speed of speech is set to be very fast, slightly slow, fast or slow, very slow, and specifically, the average number of words in unit time corresponding to different emotion tags can be obtained according to historical sample data, and the number of words in unit time corresponding to different emotion tags and the relative size relationship of the speed of speech corresponding to different emotion tags are set to qualitatively determine the corresponding number of words in unit time interval according to different emotion tags. The following judgment aiming at average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition can adopt the similar mode of drawing a set judgment interval based on sample data and relative relation to realize that the average reference is analyzed based on the collected sound data, and the qualitative analysis degree comprises very high, slightly low, very high and very low; the fundamental frequency range includes very wide and narrow; the strength comprises normal, higher and lower; the tone quality includes: irregular, with breathing sound, causing resonance, with breathing sound loud, beep claim; the fundamental frequency variation includes: normal, stressed syllable mutation, downward deformation, smooth upward deformation, downward deformation to the extreme; the definitions include: precise, tense, unclear, normal, and normal. It is specified in the following table:
Figure BDA0002140470220000081
in one embodiment, after verifying the accuracy of speech recognition and emotion tag in the selected application scenario, the method further includes: and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
Besides the voice recognition test under the conventional environment, the voice recognition test under the noise can be performed in a targeted manner, specifically, the voice recognition test under the noise environment can be obtained by collecting user response voice data based on a preset dialogues script under the noise environment in a selected application scene, and repeating the test process by using the collected user response voice data as a detection parameter. Furthermore, the voice recognition effect under the remote condition can be tested, and the method can be realized by only taking the user reply voice data collected under the remote condition as the test data and repeating the test process.
It should be understood that although the various steps in the flow diagrams of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps
As shown in fig. 4, a speech recognition result testing apparatus includes:
the data acquisition module 100 is configured to randomly select user reply voice data based on a preset dialogues script in any application scenario;
a dividing module 200, configured to obtain a user speech segment in the user reply speech data, divide the user speech segment into multiple sub-speech segments with preset time lengths, and allocate sub-speech segment identifiers;
the feature extraction module 300 is configured to extract acoustic features of each sub-speech segment, and obtain emotion tags of each sub-speech segment according to the acoustic features;
the splicing and combining module 400 is configured to acquire text data corresponding to each sub-session by using a speech recognition technology, linearly splice the emotion tag of each sub-session with the corresponding text data, and add the sub-session identifier between the emotion tag and the text data to obtain a speech recognition result of each sub-session;
and the test module 500 is configured to compare the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, count the sub-speech segment occupation ratios with the consistent voice recognition result, and obtain the accuracy of the voice recognition result in the selected application scene.
According to the voice recognition result testing device, the user response voice data based on the preset voice script under any application scene is randomly selected, the user voice segment in the user response voice data is divided into a plurality of sub-voice segments with preset time length, the acoustic feature of each sub-voice segment is extracted, the emotion label of each sub-voice segment is obtained according to the acoustic feature, the emotion label and the user response voice data are linearly spliced, the sub-voice segment identifier is added, the voice recognition result corresponding to each sub-voice segment is compared with the standard voice recognition result, the proportion of the sub-voice segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.
In one embodiment, the feature extraction module 300 is further configured to extract acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the feature extraction module 300 is further configured to obtain reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time structural features, amplitude structural features, fundamental frequency structural features and formant structural features as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the feature extraction module 300 is further configured to extract emotion labels in the training data and corresponding time structure features, amplitude structure features, fundamental frequency structure features, and formant structure features; training local emotion labels learned by a convolutional neural network part in the neural network according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the feature extraction module 600 is further configured to obtain an emotion tag according to the extracted acoustic features of each sub-speech segment and the voice feature qualitative analysis result corresponding to the preset emotion tag; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the voice recognition result testing apparatus further includes a loop testing module, configured to delay a preset time, and control the data obtaining module 100, the dividing module 200, the feature extracting module 300, the recognition result combining module 400, and the comparison testing module 500 to execute corresponding operations.
For the specific definition of the speech recognition result testing device, reference may be made to the above definition of the speech recognition result testing method, and details are not described herein again. All or part of the modules in the voice recognition result testing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing preset dialoging scripts and historical expert data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition result testing method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding a sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment;
and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment proportion with the consistent voice recognition result to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
extracting acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding a sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment proportion with the consistent voice recognition result to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the computer program when executed by the processor further performs the steps of:
extracting acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring reply voice sample data corresponding to different emotion labels; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the computer program when executed by the processor further performs the steps of:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the computer program when executed by the processor further performs the steps of:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is instructed by a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of testing speech recognition results, the method comprising:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech sections, and acquiring the emotion labels of the sub-speech sections according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the proportion of the sub-speech segments with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
2. The method of claim 1, wherein the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:
extracting acoustic features of each sub-speech segment;
and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
3. The method of claim 2, further comprising:
acquiring reply voice sample data corresponding to different emotion labels;
extracting time construction features, amplitude construction features, fundamental frequency construction features and formant construction features in the reply voice sample data;
and training a deep learning-based neural network model by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained deep learning-based neural network model.
4. The method of claim 3, wherein training the deep learning based neural network model, and wherein obtaining the trained deep learning based neural network model comprises:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics;
training a convolutional neural network part and a learned local emotion label in the neural network based on deep learning according to the extracted feature data;
abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning a global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
5. The method of claim 1, wherein the extracting the acoustic features of each sub-speech segment, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels;
the acoustic feature qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic features and qualitative analysis interval data of the acoustic features corresponding to different emotion tags, wherein the acoustic features comprise speech speed, average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
6. The method of claim 1, after obtaining the accuracy of the speech recognition result in the selected application scenario, further comprising:
delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues script under any application scene.
7. A speech recognition result testing apparatus, comprising:
the data acquisition module is used for randomly selecting user reply voice data based on a preset talk script in any application scene;
the dividing module is used for acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length and distributing sub-speech segment identifiers;
the feature extraction module is used for extracting the acoustic features of the sub-speech segments and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
the splicing combination module is used for acquiring the text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and the test module is used for comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, counting the sub-speech segment occupation ratios with consistent voice recognition results and obtaining the accuracy of the voice recognition result in the selected application scene.
8. The apparatus of claim 7, wherein the feature extraction module is further configured to extract acoustic features of each sub-speech segment, and input the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201910667054.6A 2019-07-23 2019-07-23 Voice recognition result testing method and device, computer equipment and medium Active CN110556098B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910667054.6A CN110556098B (en) 2019-07-23 2019-07-23 Voice recognition result testing method and device, computer equipment and medium
PCT/CN2019/116960 WO2021012495A1 (en) 2019-07-23 2019-11-11 Method and device for verifying speech recognition result, computer apparatus, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910667054.6A CN110556098B (en) 2019-07-23 2019-07-23 Voice recognition result testing method and device, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN110556098A CN110556098A (en) 2019-12-10
CN110556098B true CN110556098B (en) 2023-04-18

Family

ID=68735961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910667054.6A Active CN110556098B (en) 2019-07-23 2019-07-23 Voice recognition result testing method and device, computer equipment and medium

Country Status (2)

Country Link
CN (1) CN110556098B (en)
WO (1) WO2021012495A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021134550A1 (en) * 2019-12-31 2021-07-08 李庆远 Manual combination and training of multiple speech recognition outputs
CN111522943A (en) * 2020-03-25 2020-08-11 平安普惠企业管理有限公司 Automatic test method, device, equipment and storage medium for logic node
CN112349290B (en) * 2021-01-08 2021-04-20 北京海天瑞声科技股份有限公司 Triple-based speech recognition accuracy rate calculation method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595005B2 (en) * 2010-05-31 2013-11-26 Simple Emotion, Inc. System and method for recognizing emotional state from a speech signal
CN104464757B (en) * 2014-10-28 2019-01-18 科大讯飞股份有限公司 Speech evaluating method and speech evaluating device
CN105741832B (en) * 2016-01-27 2020-01-07 广东外语外贸大学 Spoken language evaluation method and system based on deep learning
US9870765B2 (en) * 2016-06-03 2018-01-16 International Business Machines Corporation Detecting customers with low speech recognition accuracy by investigating consistency of conversation in call-center
CN107767881B (en) * 2016-08-15 2020-08-18 中国移动通信有限公司研究院 Method and device for acquiring satisfaction degree of voice information
CN106548772A (en) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 Speech recognition test system and method
CN108538296A (en) * 2017-03-01 2018-09-14 广东神马搜索科技有限公司 Speech recognition test method and test terminal
CN107086040B (en) * 2017-06-23 2021-03-02 歌尔股份有限公司 Voice recognition capability test method and device
CN107452404A (en) * 2017-07-31 2017-12-08 哈尔滨理工大学 The method for optimizing of speech emotion recognition
CN108777141B (en) * 2018-05-31 2022-01-25 康键信息技术(深圳)有限公司 Test apparatus, test method, and storage medium
CN109272993A (en) * 2018-08-21 2019-01-25 中国平安人寿保险股份有限公司 Recognition methods, device, computer equipment and the storage medium of voice class

Also Published As

Publication number Publication date
CN110556098A (en) 2019-12-10
WO2021012495A1 (en) 2021-01-28

Similar Documents

Publication Publication Date Title
CN110781916B (en) Fraud detection method, apparatus, computer device and storage medium for video data
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN110120224B (en) Method and device for constructing bird sound recognition model, computer equipment and storage medium
Shahin et al. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
CN110556098B (en) Voice recognition result testing method and device, computer equipment and medium
US10573307B2 (en) Voice interaction apparatus and voice interaction method
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN111311327A (en) Service evaluation method, device, equipment and storage medium based on artificial intelligence
CN109087667B (en) Voice fluency recognition method and device, computer equipment and readable storage medium
CN111597818B (en) Call quality inspection method, device, computer equipment and computer readable storage medium
CN109272993A (en) Recognition methods, device, computer equipment and the storage medium of voice class
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN111080109A (en) Customer service quality evaluation method and device and electronic equipment
CN111182162A (en) Telephone quality inspection method, device, equipment and storage medium based on artificial intelligence
CN110797032B (en) Voiceprint database establishing method and voiceprint identification method
CN111145733A (en) Speech recognition method, speech recognition device, computer equipment and computer readable storage medium
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN112863489B (en) Speech recognition method, apparatus, device and medium
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN112232276A (en) Emotion detection method and device based on voice recognition and image recognition
Heracleous et al. Speech emotion recognition in noisy and reverberant environments
CN111565254B (en) Call data quality inspection method and device, computer equipment and storage medium
Szekrényes et al. Classification of formal and informal dialogues based on turn-taking and intonation using deep neural networks
Poorjam et al. Quality control in remote speech data collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant