CN110211567A

CN110211567A - Voice recognition terminal evaluation system and method

Info

Publication number: CN110211567A
Application number: CN201910393143.6A
Authority: CN
Inventors: 傅蓉蓉; 刘毓伟; 李玮; 董千洲; 张小雨
Original assignee: China Academy of Information and Communications Technology CAICT
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2019-09-06

Abstract

It includes: voice playing equipment that the present invention, which provides a kind of voice recognition terminal evaluation system and method, the system, for exporting test voice corpus；Terminal to be measured obtains recognition result for identifying test voice corpus under the different test environment for including noise testing environment；Noise generates equipment, for noise needed for generating test；Image capture device obtains for carrying out Image Acquisition to recognition result and speech recognition image is sent to control equipment；Control equipment, for converting test voice corpus with corpus of text for test by phoneme synthesizing method, image recognition is carried out to speech ciphering equipment image based on deep learning algorithm and obtains recognition result, recognition result is compared acquisition comparison result with preset tape label data, comparison result is used to show the speech recognition performance of terminal to be measured.The program uses automatic test, can support reperformance test, can reduce cost of labor using the functional test compared based on deep learning algorithm.

Description

Voice recognition terminal evaluation system and method

Technical field

The present invention relates to technical field of voice recognition, in particular to a kind of voice recognition terminal evaluation system and method.

Background technique

With the fast development and growth of Internet of Things, interactive voice become open Internet of Things entrance comparative maturity mode it One.Speech recognition technology also becomes most popular one of the technology of current consumption science and technology market.The test of speech recognition is with voice The mature landing of interaction technique, more and more attention has been paid to.But speech recognition test is also in developing stage, most of tests It requires manually to complete, test effect is bad and labor intensive cost.

Summary of the invention

The embodiment of the invention provides a kind of voice recognition terminal evaluation system and methods, solve and use in the prior art Artificial the technical issues of carrying out bad test effect caused by speech recognition and labor intensive cost.

Voice recognition terminal evaluation system provided in an embodiment of the present invention includes: to control equipment, terminal to be measured, image to adopt Collect equipment, voice playing equipment and noise generate equipment, wherein the control equipment and described image acquisition equipment and described Voice playing equipment connection；

Wherein, the control equipment is used for: converting test term with corpus of text for test by phoneme synthesizing method Sound corpus；

The voice playing equipment is used for: the test voice corpus is exported；

The terminal to be measured is used for: under different test environment, identifying the test term of the voice playing equipment output Sound corpus obtains recognition result, and the different test environment include noise testing environment；

The noise generates equipment and is used for: noise needed for generating test in noise testing environment；

Described image acquisition equipment is used for: Image Acquisition carried out to the recognition result, obtains speech recognition image, it will The speech recognition image is sent to the control equipment；

The control equipment is also used to: being carried out image recognition to the speech ciphering equipment image based on deep learning algorithm, is obtained Recognition result is obtained, the recognition result is compared with preset tape label data, obtains comparison result, it is described relatively to tie Fruit is used to show the speech recognition performance of terminal to be measured.

The voice recognition terminal assessment method that the embodiment of the present invention also provides includes:

It controls equipment and test voice corpus is converted with corpus of text for test by phoneme synthesizing method；

Voice playing equipment exports test voice corpus；

Noise generates noise needed for equipment generates test in noise testing environment；

Terminal to be measured identifies the test voice corpus of the voice playing equipment output, obtains under different test environment Recognition result is obtained, the different test environment include noise testing environment；

Image capture device carries out Image Acquisition to the recognition result, obtains speech recognition image, the voice is known Other image is sent to the control equipment；

It controls equipment and is based on deep learning algorithm to speech ciphering equipment image progress image recognition, obtain recognition result, The recognition result is compared with preset tape label data, obtain comparison result, the comparison result be used to show to Survey the speech recognition performance of terminal；

Wherein, the control equipment acquires equipment with described image and the voice playing equipment is connect.

The embodiment of the invention also provides a kind of computer equipments, including memory, processor and storage are on a memory And the computer program that can be run on a processor, the processor realize side described above when executing the computer program Method.

The embodiment of the invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage There is the computer program for executing method described above.

In embodiments of the present invention, control equipment converts test with corpus of text for test by phoneme synthesizing method and uses Voice corpus, voice playing equipment export test voice corpus, and noise generates noise of equipment and generates equipment, to be measured Terminal identifies the test voice corpus of the voice playing equipment output, obtains recognition result under different test environment, The different test environment include noise testing environment, and image capture device carries out Image Acquisition to the recognition result, obtains Speech recognition image, control equipment are based on deep learning algorithm and carry out image recognition to the speech ciphering equipment image, identified As a result, the recognition result is compared with preset tape label data, comparison result is obtained, the comparison result is used to Show the speech recognition performance of terminal to be measured.Compared with prior art, the present invention uses automatic test, can support repeatability Test, can reduce cost of labor using the functional test compared based on deep learning algorithm.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without creative efforts, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is a kind of voice recognition terminal evaluation system structural block diagram provided in an embodiment of the present invention；

Fig. 2 is a kind of equipment placement position schematic diagram provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this Embodiment in invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In embodiments of the present invention, a kind of voice recognition terminal evaluation system is provided, as shown in Figure 1, the system packet Include: control equipment 4, terminal to be measured 3, image capture device 1, voice playing equipment 5 and noise generate equipment 6, wherein described Control equipment 4 acquires equipment 1 with described image and the voice playing equipment 5 is connect；

Wherein, the control equipment 4 is used for: will be tested by phoneme synthesizing method (TTS, i.e. Text To Speech) Test voice corpus is converted into corpus of text；

The voice playing equipment 5 is used for: the test voice corpus is exported；

The terminal to be measured 3 is used for: under different test environment, identifying that the test of the voice playing equipment output is used Voice corpus obtains recognition result, and the different test environment include noise testing environment；

The noise generates equipment 6 and is used for: noise needed for generating test in noise testing environment；

Described image acquisition equipment 1 is used for: Image Acquisition carried out to the recognition result, obtains speech recognition image, it will The speech recognition image is sent to the control equipment；

The control equipment 4 is also used to: image recognition is carried out to the speech ciphering equipment image based on deep learning algorithm, Recognition result is obtained, the recognition result is compared with preset tape label data, obtains comparison result, the comparison As a result it is used to show the speech recognition performance of terminal to be measured.

Wherein, preset tape label data refer to: corresponding to testing material existing one for tested speech ciphering equipment The expected correct image data of gained of group.It is tied compared with the recognition result is compared acquisition with preset tape label data Fruit refers to: by carrying out recognizer comparison to the tested speech ciphering equipment image grabbed in test and expected correct images, obtaining Obtain comparison result and statistical correction rate.

In embodiments of the present invention, as shown in Figure 1, image capture device 1 can be high-speed camera, pass through bracket It is set up in the upper surface of terminal 3 to be measured.Voice playing equipment 5 can be artificial mouth, and it can be high-fidelity that noise, which generates equipment 6, Speaker, control equipment 4 can be computer equipment.

As shown in Figure 1, terminal 3 to be measured should belong in same Wireless LAN 7 with control equipment 4.Radio connection Can include but is not limited to 3G/4G connection, WiFi connection, bluetooth connection and other it is currently known or in the future exploitation it is wireless Connection type.

As shown in Figure 1, the system can also include testboard 2, wherein terminal 3 to be measured is placed on testboard 2.

In embodiments of the present invention, noise mentioned above, which generates equipment 6, can be artificial setting adjusting noise generation equipment 6 noise to keep its generation required.It is connect furthermore it is also possible to which noise is generated equipment 6 with control equipment 4, by being set in control Setting noise generates parameter in standby 4, and the noise of setting is then generated parameter and is sent to noise generation equipment 6, then noise produces Generating apparatus 6 generates parameter according to above-mentioned noise and generates corresponding noise, and the automatic production of noise may be implemented in this way It is raw.

In embodiments of the present invention, the placement position that terminal 3 to be measured, voice playing equipment 5 and noise generate equipment 6 can With as shown in Figure 2.Also different according to the difference of terminal 3 to be measured, and the distance of artificial mouth, such as terminal 3 to be measured is that mobile phone is whole End, horizontal distance can be controlled in 50cm；If terminal 3 to be measured is intelligent sound box, the horizontal distance with artificial mouth can be 3m and 5m. According to test request and environment, if having particular/special requirement (market/kindergarten/office etc.) to environment, noise generates equipment 6 can It is placed at the horizontal distance 1.5m apart from terminal 3 to be measured.If terminal 3 to be measured is intelligent sound box, generally all configuration has noise suppression The microphone of production, so the angle that voice playing equipment 5 and noise generate equipment 6 can be at 45 °, 90 °, 135 °, 180 ° Angularly position is tested in left and right sides.

In embodiments of the present invention, (intelligent sound equipment, smart television, bluetooth be can be since terminal 3 to be measured is different Earphone, smart phone, intelligent sound box etc.), testing requirement is also difference, as shown in table 1.

Table 1

Test dimension can be set when carrying out speech recognition test based on this.Specifically, the control equipment is also used In: multiple test dimensions are set according to the termination property to be measured；

Image recognition is carried out to the speech ciphering equipment image according to multiple test dimensions, it is corresponding to obtain multiple test dimensions Multiple recognition results, by the corresponding multiple recognition results of multiple test dimensions respectively with corresponding preset tape label data into Row compares, and obtains the corresponding multiple comparison results of multiple test dimensions, and the corresponding multiple comparison results of multiple examination dimensions are carried out Statistical analysis, obtains statistic analysis result, and the statistic analysis result is used to show the speech recognition performance of terminal to be measured.

In embodiments of the present invention, the design for controlling the testing material in equipment 4 should ensure that with tested speech and actually answer With the consistency of scene.User's factor that voice recognition terminal may face include but is not limited to different user, category of language, Accent, pronunciation, word speed, vocabulary, context, distance, noise circumstance etc., corpus should fully consider each influence factor when designing. The selection of corpus content, and be designed according to the demand of object to be measured.The application scenarios of voice recognition terminal are used for mostly Family life, vehicle-mounted, public place, content can cover life & amusement, business meetings etc..Specific standard corpus collection such as table 2 It is shown.

Table 2

Based on this, when carrying out speech recognition test, control equipment selection to be that test corpus of text includes multiple, For testing material collection.Such as, it may be considered that gender, region and languages are concentrated from designed standard corpus and choose testing material Collection.

For example, voice wake-up, semantic understanding can be chosen according to the testing requirement of the built-in voice assistant in smart phone With user's delivery rate, service this four test dimensions of covering as test item.

The system specifically proceeds as follows speech recognition test:

The control equipment is used for: being converted multiple tests with corpus of text for multiple tests by phoneme synthesizing method and is used Voice corpus；

The voice playing equipment (can be dummy head) is used for: the multiple test is sequentially output with voice corpus；

The terminal to be measured (can be smart phone) is used for: under different test environment, successively identifying that the voice is broadcast The multiple test voice corpus for putting equipment output obtain multiple recognition results, and the different test environment include noise testing Environment.

The noise generates equipment (can be high-fidelity music center) and is used for: generating needed for test in noise testing environment Noise.

For example, providing quiet environment, ambient noise (three kinds of test wrappers in market and office according to test equipment characteristic Border) it is tested, dummy head and smart phone horizontal distance are 50cm, and specific placement position is as shown in Figure 2.Due to intelligent hand The angle of straight line where the microphone of machine does not have noise suppressing function, high-fidelity music center (noise) and dummy head and smart phone It is not required.

Described image acquisition equipment is used for: being carried out Image Acquisition to the multiple recognition result, is obtained multiple speech recognitions The multiple speech recognition image is sent to the control equipment by image；

The control equipment is also used to: assigning different weights to multiple test dimensions according to the termination property to be measured (weight parameter can be adjusted flexibly according to Devices to test)；The multiple speech ciphering equipment image is carried out according to multiple test dimensions Image recognition obtains multiple test voice corpus, the corresponding multiple recognition results of multiple test dimensions, multiple tests is used Voice corpus, the corresponding multiple recognition results of multiple test dimensions are compared with corresponding preset tape label data respectively, Multiple test voice corpus, the corresponding multiple comparison results of multiple test dimensions are obtained, the comparison result is to use test The successful number of voice corpus identification is counted according to the weight of multiple test dimensions, the corresponding successfully number of multiple comparison results Analysis is as a result, the statistic analysis result is used to show the speech recognition performance of terminal to be measured.

In some embodiments, it is covered such as test dimension for service, total N test case (i.e. test voice corpus) controls Control equipment is sequentially output test case by artificial mouth, while being shone with the feedback that high-speed camera is continuously shot mobile phone speech assistant Piece.Computer equipment compares feedback result and tape label data using image algorithm, exports as a result, such as completing this test case Then terminate this test, otherwise continues artificial mouth output test case and carry out retest, until this test terminates.

The test of voice recognition terminal automation may be implemented by the above method.

Specifically, the control equipment is specifically used for:

Statistical is obtained according to the weight of multiple test dimensions, the corresponding successfully number of multiple comparison results according to following formula Analyse result:

Wherein, Score is statistic analysis result；K is test dimension sum；The weight of dimension, i=are tested for i-th 1,2 ..., k, andn_SuccessFor the successful number identified under each test dimension to test voice corpus；N is that test is used Voice corpus number.

If the weight of each dimension is 0.25, the final score of last voice assistant test result is

Summarize generation assessment report by the way that formula is for statistical analysis to each test dimension, can intuitively find out assessment knot Fruit.

Equally, in some embodiments, some test dimension can also continue to draw molecular testing dimension carry out a weight assignment comment Divide voice wake-up test item such as that can be divided into false wake-up, wake-up response time, wake-up rate can distinguish the different weight of assignment.Language A kind of methods of marking of sound wake-up test item are as follows: wake-up rate is scored atWherein, n_AlwaysFor time correctly waken up Number, N are wake-up test total degree；False wake-up rate is scored atWherein, n_AccidentallyIt is missed for equipment under test The number of wake-up, N are length of testing speech hourage (it is required that N >=24)；Wakeup time T=T2-T1, at the time of T1 falls for speech, T2 is the time that terminal to be measured is begun to respond to.The test of multiplicating property can be carried out to wakeup time test item, accumulate a large amount of tests Data analyze data, carry out segmentation scoring using the form of piecewise function.Its benefit is not utilize absolute numerical value As a result, but using relative value come evaluation result.Such as some implementation test cases are to wake up delay and response time, network Delay has a certain impact to test result, after carrying out the conversion scoring of piecewise function, eliminates shadow caused by network delay It rings.Three sub- test items then finally are added to obtain the test result that final voice wakes up respectively with respective multiplied by weight.

Based on the same inventive concept, a kind of voice recognition terminal assessment method is additionally provided in the embodiment of the present invention, it is as follows Described in the embodiment in face.The principle and voice recognition terminal evaluation system solved the problems, such as due to voice recognition terminal assessment method It is similar, therefore the implementation of voice recognition terminal assessment method may refer to the implementation of voice recognition terminal evaluation system, repetition Place repeats no more.

The voice recognition terminal assessment method includes:

Voice playing equipment exports test voice corpus；

In conclusion voice recognition terminal evaluation system proposed by the present invention and method are shot by using high-speed camera The automatic test without interface may be implemented in the image alignment algorithm of photo and utilization based on deep learning, and will test Range can reduce cost of labor to subjective testing item from objective examination further expansion, and can accomplish reperformance test. On the other hand, quantifiable standards of grading are used when finally being evaluated, expansibility is strong, has a wide range of application, tester Flexible modulation parameter can be needed according to test.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer journey Sequence product.Therefore, complete hardware embodiment, complete software embodiment or combining software and hardware aspects can be used in the present invention The form of embodiment.Moreover, it wherein includes the calculating of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in machine usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions each in flowchart and/or the block diagram The combination of process and/or box in process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that instruction stored in the computer readable memory generation includes The manufacture of command device, the command device are realized in one box of one or more flows of the flowchart and/or block diagram Or the function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer Or the instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or box The step of function of being specified in figure one box or multiple boxes.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the embodiment of the present invention can have various modifications and variations.All within the spirits and principles of the present invention, made Any modification, equivalent substitution, improvement and etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of voice recognition terminal evaluation system characterized by comprising control equipment, terminal to be measured, Image Acquisition are set Standby, voice playing equipment and noise generate equipment, wherein the control equipment acquires equipment with described image and the voice is broadcast Put equipment connection；

Wherein, the control equipment is used for: converting test voice language with corpus of text for test by phoneme synthesizing method Material；

The voice playing equipment is used for: the test voice corpus is exported；

The terminal to be measured is used for: under different test environment, identifying the test voice language of the voice playing equipment output Material obtains recognition result, and the different test environment include noise testing environment；

Described image acquisition equipment is used for: being carried out Image Acquisition to the recognition result, speech recognition image is obtained, by institute's predicate Sound identification image is sent to the control equipment；

The control equipment is also used to: being carried out image recognition to the speech ciphering equipment image based on deep learning algorithm, is known Not as a result, the recognition result is compared with preset tape label data, comparison result is obtained, the comparison result is used to Show the speech recognition performance of terminal to be measured.

2. voice recognition terminal evaluation system as described in claim 1, which is characterized in that it is high speed that described image, which acquires equipment, Video camera, the voice playing equipment are artificial mouth, and it is Hi-Fi sound-box that the noise, which generates equipment,.

3. voice recognition terminal evaluation system as described in claim 1, which is characterized in that further include: testboard；

The terminal to be measured is placed on the testboard.

4. voice recognition terminal evaluation system as described in claim 1, which is characterized in that the control equipment is also made an uproar with described Sound generates equipment connection；

The control equipment is also used to: setting noise generates parameter, and noise generation parameter is sent to the noise and is generated Equipment；

The noise generates equipment and is specifically used for: generating parameter according to the noise and generates corresponding noise.

5. voice recognition terminal evaluation system as described in claim 1, which is characterized in that the control equipment is also used to: root Multiple test dimensions are set according to the termination property to be measured；

Image recognition is carried out to the speech ciphering equipment image according to multiple test dimensions, it is corresponding multiple to obtain multiple test dimensions Recognition result compares the corresponding multiple recognition results of multiple test dimensions with corresponding preset tape label data respectively Compared with the corresponding multiple comparison results of the multiple test dimensions of acquisition count the corresponding multiple comparison results of multiple examination dimensions Analysis, obtains statistic analysis result, and the statistic analysis result is used to show the speech recognition performance of terminal to be measured.

6. voice recognition terminal evaluation system as claimed in claim 5, which is characterized in that the test includes with corpus of text It is multiple；

The control equipment is used for: converting multiple test voices with corpus of text for multiple tests by phoneme synthesizing method Corpus；

The voice playing equipment is used for: the multiple test is sequentially output with voice corpus；

The terminal to be measured is used for: under different test environment, successively identifying multiple tests of the voice playing equipment output With voice corpus, multiple recognition results are obtained, the different test environment include noise testing environment；

Described image acquisition equipment is used for: Image Acquisition carried out to the multiple recognition result, obtains multiple speech recognition images, The multiple speech recognition image is sent to the control equipment；

The control equipment is also used to: assigning different weights to multiple test dimensions according to the termination property to be measured；According to Multiple test dimensions carry out image recognition to the multiple speech ciphering equipment image, obtain multiple test voice corpus, multiple surveys The corresponding multiple recognition results of dimension are tried, by multiple tests voice corpus, the corresponding multiple recognition results of multiple test dimensions It is compared respectively with corresponding preset tape label data, it is corresponding to obtain multiple test voice corpus, multiple test dimensions Multiple comparison results, the comparison result is the successful number identified to test voice corpus, according to multiple test dimensions The corresponding successfully number of weight, multiple comparison results obtains statistic analysis result, and the statistic analysis result is used to show end to be measured The speech recognition performance at end.

7. voice recognition terminal evaluation system as claimed in claim 6, which is characterized in that the control equipment is specifically used for:

Statistical analysis knot is obtained according to the weight of multiple test dimensions, the corresponding successfully number of multiple comparison results according to following formula Fruit:

Wherein, Score is statistic analysis result；K is test dimension sum；For i-th test dimension weight, i=1,2 ..., K, andn_SuccessFor the successful number identified under each test dimension to test voice corpus；N is test voice corpus Number.

8. a kind of voice recognition terminal assessment method characterized by comprising

Voice playing equipment exports test voice corpus；

Terminal to be measured identifies the test voice corpus of the voice playing equipment output, is known under different test environment Not as a result, the different test environment include noise testing environment；

Image capture device carries out Image Acquisition to the recognition result, speech recognition image is obtained, by the speech recognition figure As being sent to the control equipment；

It controls equipment and is based on deep learning algorithm to speech ciphering equipment image progress image recognition, recognition result is obtained, by institute It states recognition result to be compared with preset tape label data, obtains comparison result, the comparison result is used to show end to be measured The speech recognition performance at end；

9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes claim 8 the method when executing the computer program.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has perform claim It is required that the computer program of 8 the methods.