CN110556098B - Voice recognition result testing method and device, computer equipment and medium - Google Patents
Voice recognition result testing method and device, computer equipment and medium Download PDFInfo
- Publication number
- CN110556098B CN110556098B CN201910667054.6A CN201910667054A CN110556098B CN 110556098 B CN110556098 B CN 110556098B CN 201910667054 A CN201910667054 A CN 201910667054A CN 110556098 B CN110556098 B CN 110556098B
- Authority
- CN
- China
- Prior art keywords
- sub
- speech
- emotion
- voice
- recognition result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 38
- 230000008451 emotion Effects 0.000 claims abstract description 136
- 238000013515 script Methods 0.000 claims abstract description 24
- 238000010276 construction Methods 0.000 claims description 64
- 238000013135 deep learning Methods 0.000 claims description 43
- 238000003062 neural network model Methods 0.000 claims description 32
- 238000000034 method Methods 0.000 claims description 31
- 238000012549 training Methods 0.000 claims description 28
- 238000004451 qualitative analysis Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 238000005516 engineering process Methods 0.000 claims description 16
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 4
- 238000010998 test method Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 9
- 230000004044 response Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000007935 neutral effect Effects 0.000 description 4
- 206010063659 Aversion Diseases 0.000 description 2
- 208000037656 Respiratory Sounds Diseases 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The application relates to the technical field of artificial intelligence, is applied to the voice recognition industry, and provides a voice recognition result testing method, a device, computer equipment and a storage medium, wherein user reply voice data based on a preset voice script under any application scene is randomly selected, a user voice segment in the user reply voice data is divided into a plurality of sub voice segments with preset time length, the acoustic characteristics of each sub voice segment are extracted, the emotion label of each sub voice segment is obtained according to the acoustic characteristics, the emotion label is linearly spliced with the user reply voice data, the sub voice segment identification is added, the voice recognition results corresponding to each sub voice segment are compared with standard voice recognition results, the sub voice segment occupation ratios with consistent voice recognition results are counted, and the accuracy of the voice recognition results under the selected application scene can be efficiently and accurately verified.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for testing speech recognition results, a computer device, and a storage medium.
Background
With the development of scientific technology, artificial intelligence technology is applied to more and more fields, convenience is brought to production and life of people, and the speech recognition technology is also developed and applied in the future as an important component of the artificial intelligence technology.
Among Speech Recognition technologies, ASR (Automatic Speech Recognition) is a technology that is currently used in a relatively wide range, and particularly, ASR is a technology that converts human Speech into text. Speech recognition is a multidisciplinary intersection field that is tightly connected to many disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints and the performance of speech recognition systems is a multiple of a factor. And because the situation of various factors is different under different application environments, the situation that the accuracy of ASR emotion recognition is low under different application scenes is easily caused, and if the ASR is not verified, a voice recognition error is easily caused, so that the service requirement cannot be met.
Therefore, it is necessary to provide an accurate speech recognition result test scheme.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for testing accurate speech recognition results.
A method of testing speech recognition results, the method comprising:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:
extracting acoustic features of each sub-speech segment;
and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the testing the speech recognition result further includes:
acquiring reply voice sample data corresponding to different emotion tags;
extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data;
and training a deep learning-based neural network model by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained deep learning-based neural network model.
In one embodiment, the training of the deep learning based neural network model, and the obtaining of the trained deep learning based neural network model includes:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics;
training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data;
abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning a global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the extracting the acoustic features of the sub-speech segments and the obtaining the emotion labels of the sub-speech segments according to the acoustic features includes:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels;
the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, wherein the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, after verifying the accuracy of the speech recognition and the emotion tag in the selected application scenario, the method further includes:
delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues in any application scene.
A speech recognition result testing apparatus, the apparatus comprising:
the data acquisition module is used for randomly selecting user reply voice data based on a preset talk script in any application scene;
the dividing module is used for acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length and distributing sub-speech segment identifiers;
the feature extraction module is used for extracting the acoustic features of the sub-speech segments and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
the splicing combination module is used for acquiring the text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and the test module is used for comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, counting the sub-speech segment occupation ratios with consistent voice recognition results and obtaining the accuracy of the voice recognition result in the selected application scene.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.
According to the voice recognition result testing method, the voice recognition result testing device, the computer equipment and the storage medium, the user reply voice data based on the preset dialogues script under any application scene is randomly selected, the user dialog segment in the user reply voice data is divided into the multiple sub-dialog segments with the preset time length, the acoustic feature of each sub-dialog segment is extracted, the emotion label of each sub-dialog segment is obtained according to the acoustic feature, the emotion label is linearly spliced with the user reply voice data, the sub-dialog segment identifier is added, the voice recognition result corresponding to each sub-dialog segment is compared with the standard voice recognition result, the proportion of the sub-dialog segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.
Drawings
FIG. 1 is a flow diagram illustrating a method for testing speech recognition results in one embodiment;
FIG. 2 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;
FIG. 3 is a flowchart illustrating a method for testing speech recognition results according to another embodiment;
FIG. 4 is a block diagram showing the structure of a speech recognition result testing apparatus according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As shown in fig. 1, there is provided a speech recognition result testing method, including the steps of:
s100: and randomly selecting the user reply voice data based on the preset dialogues in any application scene.
The preset dialogue script is based on the dialogue script data written under different application scenes, and specifically comprises question and answer data which simulates the dialogue between a client and an operator (service staff) under a real environment. Optionally, the conversational scripts in different application scenarios may be collected and stored in a database, and the corresponding conversational scripts in different application scenarios are stored in the database. The application scenes comprise loan marketing, payment urging, loan consultation and the like. The server simulates question and answer voice data responded based on a preset conversational script in a certain application scene. Specifically, an application scenario set can be constructed for application scenarios needing to be verified, and any one application scenario in the application scenario set is selected as a current test scenario.
S200: and acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers.
The server intercepts the reply voice and divides the user voice segment in the reply voice into sub voice segments with preset time length. Specifically, the preset time length is relatively short, such as 3-5 seconds; i.e. the user speech segment is divided into sub-speech segments of 3-5 seconds in length.
S300: and extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features.
The acoustic features include sound waves, signals, tones, and the like. The emotion labels comprise neutral, happy, wounded, angry, surprise, fear, aversion to, excitement and the like. Optionally, a window with a preset time interval can be set, acoustic features are collected at a fixed frequency to form an acoustic feature set, and the emotion label is obtained according to the acoustic feature set.
S400: and acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section.
Each sub-speech segment is taken as a research object, the emotion labels of the sub-speech segments are linearly spliced with the corresponding text data, the linear splicing process can be understood as a plus process, namely two parts of data are spliced together, and a sub-speech segment identifier is added between the two parts of data, so that the voice recognition result of each sub-speech segment can be accurately distinguished in the following process. Specifically, the process of linear concatenation may be simply understood as concatenating text data with an emotion tag, for example, if text data corresponding to a certain sub-speech segment is "ok", the emotion tag is "happy", the sub-speech segment is identified as a, and the obtained speech recognition result is "happy with" a ".
S500: and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment occupation ratios with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
The standard speech recognition results are derived from analyzing the dialogies script based on expert empirical data. The method can also be written into a preset conversation script database, namely, a conversation script file-standard voice recognition result corresponding to each sub-conversation segment and a corresponding relation of the standard voice recognition result are stored in the preset conversation script database, and text data corresponding to the sub-conversation segments, sub-conversation segment identifiers and corresponding emotion labels are carried in the standard voice recognition result. The user response voice data of the preset voice script corresponding to each application scene comprises a plurality of sub-voice segments, the number of the sub-voice segments with the voice recognition results of the sub-voice segments consistent with the voice recognition results of the sub-voice segments carried in the preset standard voice recognition result in the selected application scene is recorded and compared, the proportion of the number of the sub-voice segments in the whole user response voice data comprising the sub-voice segments is calculated, and the obtained proportion is the accuracy of the voice recognition result in the selected application scene. For example, there are currently 3 sub-speech segments (the actual situation is much larger than this number), and the speech recognition result of each sub-speech segment is obtained as follows: you are happy with A, don't want B neutral, and then see C aversion; the corresponding standard speech recognition results include: you are good at neutral A, do not want neutral B, and then see C dislike, the accuracy of obtaining the speech recognition result under the selected application scene is 66.7%. Optionally, after testing the accuracy of the speech recognition and emotion tag in the currently selected application scenario, a new application scenario may be reselected for verification, and the speech recognition result testing process may be repeated.
According to the voice recognition result testing method, the user reply voice data based on the preset voice script under any application scene is randomly selected, the user voice section in the user reply voice data is divided into a plurality of sub-voice sections with preset time length, the acoustic feature of each sub-voice section is extracted, the emotion label of each sub-voice section is obtained according to the acoustic feature, the emotion label and the user reply voice data are linearly spliced, the sub-voice section mark is added, the voice recognition result corresponding to each sub-voice section is compared with the standard voice recognition result, the proportion of the sub-voice sections with consistent voice recognition results is counted, and the accuracy of the voice recognition result under the selected application scene can be efficiently and accurately verified.
As shown in fig. 2, in one embodiment, step S300 includes:
s320: and extracting the acoustic features of each sub-speech segment.
S340: and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
The acoustic features can be further classified into time structure features, amplitude structure features, fundamental frequency structure features and formant structure features, and the corresponding relation between the features and corresponding emotion labels is obtained through training in a trained neural network model based on deep learning.
As shown in fig. 3, in one embodiment, step S300 further includes:
s312: and acquiring reply voice sample data corresponding to different emotion labels.
S314: and extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data.
S316: and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
And when the emotion tag needs to be acquired, inputting the extracted acoustic feature data into the emotion tag identification model to obtain the emotion tag corresponding to the sentence, and integrating the emotion tag and the responded voice data to obtain a voice identification result.
In one embodiment, training the deep learning based neural network model, and obtaining the trained deep learning based neural network model comprises: extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, extracting the acoustic features of each sub-speech segment, and obtaining the emotion label of each sub-speech segment according to the acoustic features comprises: obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
The different emotion javelins correspond to the qualitative analysis intervals of different acoustic features, and the qualitative analysis intervals can be specifically divided into several interval values in advance according to the acoustic feature types, for example, the interval values can be divided into fast interval values, slightly slow interval values, fast interval values, slow interval values or slow interval values according to the speech speed. More specifically, qualitative analysis is performed on the conditions, including the speech speed, the average fundamental frequency, the fundamental frequency range, the intensity, the tone quality, the fundamental frequency change and the definition, corresponding to the emotion tag to be selected, so as to obtain a qualitative analysis result, and the emotion tag is obtained according to the acoustic features of each currently extracted sub-speech segment and the corresponding qualitative analysis result. Furthermore, emotion label feature templates can be respectively constructed according to qualitative analysis results corresponding to different emotion labels, and when emotion label identification is required, the collected features are matched with the emotion label feature templates to determine the emotion labels. In practical applications, the qualitative analysis includes: the speed of speech is set to be very fast, slightly slow, fast or slow, very slow, and specifically, the average number of words in unit time corresponding to different emotion tags can be obtained according to historical sample data, and the number of words in unit time corresponding to different emotion tags and the relative size relationship of the speed of speech corresponding to different emotion tags are set to qualitatively determine the corresponding number of words in unit time interval according to different emotion tags. The following judgment aiming at average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition can adopt the similar mode of drawing a set judgment interval based on sample data and relative relation to realize that the average reference is analyzed based on the collected sound data, and the qualitative analysis degree comprises very high, slightly low, very high and very low; the fundamental frequency range includes very wide and narrow; the strength comprises normal, higher and lower; the tone quality includes: irregular, with breathing sound, causing resonance, with breathing sound loud, beep claim; the fundamental frequency variation includes: normal, stressed syllable mutation, downward deformation, smooth upward deformation, downward deformation to the extreme; the definitions include: precise, tense, unclear, normal, and normal. It is specified in the following table:
in one embodiment, after verifying the accuracy of speech recognition and emotion tag in the selected application scenario, the method further includes: and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
Besides the voice recognition test under the conventional environment, the voice recognition test under the noise can be performed in a targeted manner, specifically, the voice recognition test under the noise environment can be obtained by collecting user response voice data based on a preset dialogues script under the noise environment in a selected application scene, and repeating the test process by using the collected user response voice data as a detection parameter. Furthermore, the voice recognition effect under the remote condition can be tested, and the method can be realized by only taking the user reply voice data collected under the remote condition as the test data and repeating the test process.
It should be understood that although the various steps in the flow diagrams of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps
As shown in fig. 4, a speech recognition result testing apparatus includes:
the data acquisition module 100 is configured to randomly select user reply voice data based on a preset dialogues script in any application scenario;
a dividing module 200, configured to obtain a user speech segment in the user reply speech data, divide the user speech segment into multiple sub-speech segments with preset time lengths, and allocate sub-speech segment identifiers;
the feature extraction module 300 is configured to extract acoustic features of each sub-speech segment, and obtain emotion tags of each sub-speech segment according to the acoustic features;
the splicing and combining module 400 is configured to acquire text data corresponding to each sub-session by using a speech recognition technology, linearly splice the emotion tag of each sub-session with the corresponding text data, and add the sub-session identifier between the emotion tag and the text data to obtain a speech recognition result of each sub-session;
and the test module 500 is configured to compare the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, count the sub-speech segment occupation ratios with the consistent voice recognition result, and obtain the accuracy of the voice recognition result in the selected application scene.
According to the voice recognition result testing device, the user response voice data based on the preset voice script under any application scene is randomly selected, the user voice segment in the user response voice data is divided into a plurality of sub-voice segments with preset time length, the acoustic feature of each sub-voice segment is extracted, the emotion label of each sub-voice segment is obtained according to the acoustic feature, the emotion label and the user response voice data are linearly spliced, the sub-voice segment identifier is added, the voice recognition result corresponding to each sub-voice segment is compared with the standard voice recognition result, the proportion of the sub-voice segments with the consistent voice recognition result is counted, and the accuracy of the voice recognition result under the selected application scene can be effectively and accurately verified.
In one embodiment, the feature extraction module 300 is further configured to extract acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the feature extraction module 300 is further configured to obtain reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time structural features, amplitude structural features, fundamental frequency structural features and formant structural features as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the feature extraction module 300 is further configured to extract emotion labels in the training data and corresponding time structure features, amplitude structure features, fundamental frequency structure features, and formant structure features; training local emotion labels learned by a convolutional neural network part in the neural network according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the feature extraction module 600 is further configured to obtain an emotion tag according to the extracted acoustic features of each sub-speech segment and the voice feature qualitative analysis result corresponding to the preset emotion tag; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the voice recognition result testing apparatus further includes a loop testing module, configured to delay a preset time, and control the data obtaining module 100, the dividing module 200, the feature extracting module 300, the recognition result combining module 400, and the comparison testing module 500 to execute corresponding operations.
For the specific definition of the speech recognition result testing device, reference may be made to the above definition of the speech recognition result testing method, and details are not described herein again. All or part of the modules in the voice recognition result testing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing preset dialoging scripts and historical expert data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition result testing method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding a sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment;
and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment proportion with the consistent voice recognition result to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
extracting acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring reply voice sample data corresponding to different emotion tags; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolution neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a circulating neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech segments, and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
acquiring text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding a sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the sub-speech segment proportion with the consistent voice recognition result to obtain the accuracy of the voice recognition result in the selected application scene.
In one embodiment, the computer program when executed by the processor further performs the steps of:
extracting acoustic features of each sub-speech segment; and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring reply voice sample data corresponding to different emotion labels; extracting time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics in the reply voice sample data; and training a neural network model based on deep learning by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained neural network model based on deep learning.
In one embodiment, the computer program when executed by the processor further performs the steps of:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics; training a partial emotion label learned by a convolutional neural network part in the neural network based on deep learning according to the extracted feature data; abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning the global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
In one embodiment, the computer program when executed by the processor further performs the steps of:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels; the acoustic characteristic qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic characteristic and qualitative analysis interval data of the acoustic characteristic corresponding to different emotion tags, and the acoustic characteristic comprises a speech rate, an average fundamental frequency, a fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and delaying preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues under any application scene.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is instructed by a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method of testing speech recognition results, the method comprising:
randomly selecting user reply voice data based on a preset dialogues script under any application scene;
acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length, and distributing sub-speech segment identifiers;
extracting the acoustic features of the sub-speech sections, and acquiring the emotion labels of the sub-speech sections according to the acoustic features;
acquiring text data corresponding to each sub-speech segment by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech segment with the corresponding text data, and adding the sub-speech segment identifier between the emotion label and the text data to obtain a speech recognition result of each sub-speech segment; and comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identification, and counting the proportion of the sub-speech segments with consistent voice recognition results to obtain the accuracy of the voice recognition result in the selected application scene.
2. The method of claim 1, wherein the extracting the acoustic features of the sub-speech segments, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:
extracting acoustic features of each sub-speech segment;
and inputting the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
3. The method of claim 2, further comprising:
acquiring reply voice sample data corresponding to different emotion labels;
extracting time construction features, amplitude construction features, fundamental frequency construction features and formant construction features in the reply voice sample data;
and training a deep learning-based neural network model by taking the emotion labels in the reply voice sample data and the corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics as training data to obtain the trained deep learning-based neural network model.
4. The method of claim 3, wherein training the deep learning based neural network model, and wherein obtaining the trained deep learning based neural network model comprises:
extracting emotion labels in the training data and corresponding time construction characteristics, amplitude construction characteristics, fundamental frequency construction characteristics and formant construction characteristics;
training a convolutional neural network part and a learned local emotion label in the neural network based on deep learning according to the extracted feature data;
abstracting the local emotion label through a cyclic neural network part in the convolutional neural network, and learning a global emotion label through a pooling layer in the neural network based on deep learning to obtain a trained neural network model based on deep learning.
5. The method of claim 1, wherein the extracting the acoustic features of each sub-speech segment, and the obtaining the emotion label of each sub-speech segment according to the acoustic features comprises:
obtaining emotion labels according to the extracted acoustic features of the sub-speech sections and an acoustic feature qualitative analysis table corresponding to the preset emotion labels;
the acoustic feature qualitative analysis table corresponding to the preset emotion tag carries the emotion tag, the acoustic features and qualitative analysis interval data of the acoustic features corresponding to different emotion tags, wherein the acoustic features comprise speech speed, average fundamental frequency, fundamental frequency range, intensity, tone quality, fundamental frequency change and definition.
6. The method of claim 1, after obtaining the accuracy of the speech recognition result in the selected application scenario, further comprising:
delaying the preset time, and returning to the step of randomly selecting the user reply voice data based on the preset dialogues script under any application scene.
7. A speech recognition result testing apparatus, comprising:
the data acquisition module is used for randomly selecting user reply voice data based on a preset talk script in any application scene;
the dividing module is used for acquiring a user speech segment in the user reply speech data, dividing the user speech segment into a plurality of sub-speech segments with preset time length and distributing sub-speech segment identifiers;
the feature extraction module is used for extracting the acoustic features of the sub-speech segments and acquiring the emotion labels of the sub-speech segments according to the acoustic features;
the splicing combination module is used for acquiring the text data corresponding to each sub-speech section by adopting a speech recognition technology, linearly splicing the emotion label of each sub-speech section with the corresponding text data, and adding the sub-speech section label between the emotion label and the text data to obtain a speech recognition result of each sub-speech section;
and the test module is used for comparing the voice recognition result of each sub-speech segment with the voice recognition result of each sub-speech segment carried in the preset standard voice recognition result in the selected application scene one by one according to the sub-speech segment identifier, counting the sub-speech segment occupation ratios with consistent voice recognition results and obtaining the accuracy of the voice recognition result in the selected application scene.
8. The apparatus of claim 7, wherein the feature extraction module is further configured to extract acoustic features of each sub-speech segment, and input the extracted acoustic features into a trained neural network model based on deep learning to obtain the emotion label.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910667054.6A CN110556098B (en) | 2019-07-23 | 2019-07-23 | Voice recognition result testing method and device, computer equipment and medium |
PCT/CN2019/116960 WO2021012495A1 (en) | 2019-07-23 | 2019-11-11 | Method and device for verifying speech recognition result, computer apparatus, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910667054.6A CN110556098B (en) | 2019-07-23 | 2019-07-23 | Voice recognition result testing method and device, computer equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110556098A CN110556098A (en) | 2019-12-10 |
CN110556098B true CN110556098B (en) | 2023-04-18 |
Family
ID=68735961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910667054.6A Active CN110556098B (en) | 2019-07-23 | 2019-07-23 | Voice recognition result testing method and device, computer equipment and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110556098B (en) |
WO (1) | WO2021012495A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021134550A1 (en) * | 2019-12-31 | 2021-07-08 | 李庆远 | Manual combination and training of multiple speech recognition outputs |
CN111522943A (en) * | 2020-03-25 | 2020-08-11 | 平安普惠企业管理有限公司 | Automatic test method, device, equipment and storage medium for logic node |
CN112349290B (en) * | 2021-01-08 | 2021-04-20 | 北京海天瑞声科技股份有限公司 | Triple-based speech recognition accuracy rate calculation method |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8595005B2 (en) * | 2010-05-31 | 2013-11-26 | Simple Emotion, Inc. | System and method for recognizing emotional state from a speech signal |
CN104464757B (en) * | 2014-10-28 | 2019-01-18 | 科大讯飞股份有限公司 | Speech evaluating method and speech evaluating device |
CN105741832B (en) * | 2016-01-27 | 2020-01-07 | 广东外语外贸大学 | Spoken language evaluation method and system based on deep learning |
US9870765B2 (en) * | 2016-06-03 | 2018-01-16 | International Business Machines Corporation | Detecting customers with low speech recognition accuracy by investigating consistency of conversation in call-center |
CN107767881B (en) * | 2016-08-15 | 2020-08-18 | 中国移动通信有限公司研究院 | Method and device for acquiring satisfaction degree of voice information |
CN106548772A (en) * | 2017-01-16 | 2017-03-29 | 上海智臻智能网络科技股份有限公司 | Speech recognition test system and method |
CN108538296A (en) * | 2017-03-01 | 2018-09-14 | 广东神马搜索科技有限公司 | Speech recognition test method and test terminal |
CN107086040B (en) * | 2017-06-23 | 2021-03-02 | 歌尔股份有限公司 | Voice recognition capability test method and device |
CN107452404A (en) * | 2017-07-31 | 2017-12-08 | 哈尔滨理工大学 | The method for optimizing of speech emotion recognition |
CN108777141B (en) * | 2018-05-31 | 2022-01-25 | 康键信息技术(深圳)有限公司 | Test apparatus, test method, and storage medium |
CN109272993A (en) * | 2018-08-21 | 2019-01-25 | 中国平安人寿保险股份有限公司 | Recognition methods, device, computer equipment and the storage medium of voice class |
-
2019
- 2019-07-23 CN CN201910667054.6A patent/CN110556098B/en active Active
- 2019-11-11 WO PCT/CN2019/116960 patent/WO2021012495A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
CN110556098A (en) | 2019-12-10 |
WO2021012495A1 (en) | 2021-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110781916B (en) | Fraud detection method, apparatus, computer device and storage medium for video data | |
WO2021128741A1 (en) | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium | |
CN110120224B (en) | Method and device for constructing bird sound recognition model, computer equipment and storage medium | |
Shahin et al. | Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments | |
WO2020177380A1 (en) | Voiceprint detection method, apparatus and device based on short text, and storage medium | |
CN110556098B (en) | Voice recognition result testing method and device, computer equipment and medium | |
US10573307B2 (en) | Voice interaction apparatus and voice interaction method | |
CN104903954A (en) | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination | |
CN111311327A (en) | Service evaluation method, device, equipment and storage medium based on artificial intelligence | |
CN109087667B (en) | Voice fluency recognition method and device, computer equipment and readable storage medium | |
CN111597818B (en) | Call quality inspection method, device, computer equipment and computer readable storage medium | |
CN109272993A (en) | Recognition methods, device, computer equipment and the storage medium of voice class | |
CN109658921B (en) | Voice signal processing method, equipment and computer readable storage medium | |
CN111080109A (en) | Customer service quality evaluation method and device and electronic equipment | |
CN111182162A (en) | Telephone quality inspection method, device, equipment and storage medium based on artificial intelligence | |
CN110797032B (en) | Voiceprint database establishing method and voiceprint identification method | |
CN111145733A (en) | Speech recognition method, speech recognition device, computer equipment and computer readable storage medium | |
US11238289B1 (en) | Automatic lie detection method and apparatus for interactive scenarios, device and medium | |
CN112863489B (en) | Speech recognition method, apparatus, device and medium | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
CN112232276A (en) | Emotion detection method and device based on voice recognition and image recognition | |
Heracleous et al. | Speech emotion recognition in noisy and reverberant environments | |
CN111565254B (en) | Call data quality inspection method and device, computer equipment and storage medium | |
Szekrényes et al. | Classification of formal and informal dialogues based on turn-taking and intonation using deep neural networks | |
Poorjam et al. | Quality control in remote speech data collection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |