CN111899740A

CN111899740A - Voice recognition system crowdsourcing test case generation method based on test requirements

Info

Publication number: CN111899740A
Application number: CN202010714647.6A
Authority: CN
Inventors: 王晓冰; 吉品; 王兴亚; 倪烨
Original assignee: Shenzhen Muzhi Technology Co ltd
Current assignee: Shenzhen Muzhi Technology Co ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-06

Abstract

A speech recognition system crowdsourcing test case generation method based on test requirements is characterized in that a provided speech recognition system is tested based on a crowdsourcing technology. And by identifying and extracting the test requirements submitted by the task requester, a speech recognition keyword list in a preset keyword library is utilized, and texts related to the content of the test case are filtered from the recognized text information fragments and serve as candidate test requirements. The invention comprises three components: the system comprises a voice recognition test parameter collection module, a text extraction module and a test case generation module. The input of the invention is text test requirement, and feature extraction is carried out based on user requirement to generate a test template. The invention has the following beneficial effects: the requirement of multiple dimensions and diversification of ASR test is solved. The threshold of crowdsourcing worker test participation is reduced, the test task is not limited by a place and an identity, and the crowdsourcing worker can register and participate as long as the crowdsourcing worker has a test tool and has test capability, so that the test loss is reduced, and the product research and development period is shortened.

Description

Voice recognition system crowdsourcing test case generation method based on test requirements

Technical Field

The invention belongs to the field of software testing, and particularly relates to crowdsourcing testing and test case generation. According to the invention, the provided voice recognition system is tested based on the crowdsourcing technology, before crowdsourcing workers are summoned, a test case needs to be generated based on the test requirement provided by a task requester, and crowdsourcing task distribution is carried out around the test case.

Background

The implementation of artificial intelligence software, such As Speech Recognition (ASR) systems, does not leave a large amount of audio data. Whether the robustness of the system is verified at a data intake layer or the accuracy of the system is verified at a core algorithm layer, a large amount of data is needed. In the ASR training stage, large-scale data acquisition and cleaning work is difficult to complete with high quality; in the ASR evaluation stage, the quantity and quality of the test set also directly influence the performance evaluation result. The sound source, environmental impact, distance impact when recording, and the speaking speed, volume, timbre will all influence the recognition result. Therefore, ASR testing has a multi-dimensional, diverse requirement that crowd-sourcing this distributed-based problem-solving mechanism using crowd-sourcing just meets its requirements.

Due to the low cost and high efficiency of crowdsourcing tests, various commercial crowdsourcing test platforms have begun to appear in recent years. The crowdsourcing test is gradually developing into a mainstream test mode, but is mostly used for the functional test of a common system. Crowdsourcing tests mainly all three classes of participants: 1) the task requester: people who have test requirements and need to release test tasks on the people testing platform; 2) a crowdsourcing platform: the task requester and the many-testing worker play an intermediary role, and technical support and supervision need to be provided for the whole many-testing process; 3) a people testing worker: based on economic requirements or requirements for improving self-ability, the test report needs to be submitted and the test report submitted by others needs to be reviewed. The people testing worker ginseng and the testing task are not limited by places and identities, and can register for participation as long as the people testing worker ginseng and the testing task have testing tools and testing capabilities.

At present, the testing method for the voice recognition system at home and abroad is mainly divided into objective testing and subjective testing. The objective test is to use a prepared audio test set for testing, then use a system test tool for identification, and then manually count the identification rate. This method cannot confirm the compatibility of the recording system, and is not beneficial to the transplantation test among different manufacturers and different devices. Subjective testing can be divided into analog testing and site oral calling. The difference between the two is that the simulation test is to use a playback device to play recorded voice materials, the site spoken test is to organize several speakers to read test corpora aloud, then the professional device receives the voice, then the voice recognition system is used for recognition, and finally the result is recorded. The subjective test has the defects of overlarge randomness, waste of human resources and no benefit to test recurrence. Compared with objective test, the test takes longer time, and the product development period is prolonged.

Aiming at the defects of the existing voice recognition test method, a plurality of researchers provide voice recognition automatic tests. For example, a text file containing the test keywords is converted into a voice file by using a TTS technology, the voice file is played by using a playback device for voice recognition, and then the voice recognition system is monitored in real time by using an automatic test tool to obtain a recognition result and record the recognition result to the file. And automatically calling a result counting tool, comparing the identification result with the labeled file, and judging whether the identification result is correct or not. After recognition is finished, automatic gathering statistics is carried out, and a csv file is output, so that the test personnel can conveniently check the csv file. Although the test method can save the complexity of manual recording, the synthesized sound is very mechanical, has the difference in voice and tone with the human voice, and has a great difference with the real application scene. If crowd-sourced testing is used, a large number of workers in real application scenarios may be summoned directly, which will likely be real users of the product in the future. The workers have different timbres, speaking modes and environments, and the requirements of the voice recognition test on the multi-dimensional diversification are met. In order to reduce the threshold of the human voice recognition test and the voice recognition test of the human testers and improve the quality of the final test task result, a test case needs to be designed in advance before the test task starts. The mankind can directly test according to the steps of the test cases by checking the test cases.

Based on the work, the invention provides a test case suitable for the voice recognition system based on the test requirement submitted by the task requester. The testing procedure and method is very simple and understandable for professional testers, but lacks relevant background knowledge for those who have not contacted the speech recognition system, and is left alone, resulting in failure to complete the testing task on time. Therefore, the invention provides a voice recognition system crowdsourcing test case generation technology based on the test requirements based on the existing voice recognition system test method, character recognition, crowdsourcing test and other technologies, and reduces the research and development period of the voice recognition algorithm.

Disclosure of Invention

The invention aims to solve the problems that: the use of crowdsourcing techniques for speech recognition testing is a new attempt and therefore presents new challenges. Unlike previous functional tests, speech recognition tests have some special requirements, including test environment, noise type, etc., which can raise the threshold for many testers and testing tasks. In order to call more workers to ensure the completeness of the test, the invention can automatically generate the test case with complete information based on the test requirement so as to help the workers to better complete the test task.

The technical scheme of the invention is as follows: a crowdsourcing test case generation method based on voice recognition test requirements is characterized in that a plurality of voice recognition test cases with complete contents can be generated according to a test requirement text.

The generation method comprises the following three modules/steps:

collecting voice recognition test parameters: through a plurality of channels such as books, networks and the like, relevant data about voice recognition testing are collected and summarized and analyzed, and a set of voice recognition testing parameters with high universality and strong flexibility is formed. As shown in fig. 1 and 2, the voice recognition test requires testing from multiple dimensions from two angles (pick-up and recognition). Apart from personalized factors such as tone and tone, distance, volume, signal to noise ratio and the like can be designed in advance to guide workers to test so as to achieve the purpose of complete test. Based on the existing data, the reasonable values of several factors influencing the voice recognition under common backgrounds (families and offices) are summarized.

Distance of speaker's sound source from device: 1m, 3m and 5 m.

Pronunciation size recommendation value: 57db, 65db, 70db

Noise level recommendation value: 47db, 55db, 60db

Signal-to-noise ratio (voice level-noise level) recommendation value: 10db, 5db

Sound source and equipment angle: 0 degree, 30 degrees, 45 degrees, 90 degrees.

2) A text extraction module: and identifying and extracting the test requirements submitted by the task requester, utilizing a speech recognition keyword list in a preset keyword library, filtering out texts related to the content of the test case from the identified text information fragments, and taking the texts as candidate test requirements. Firstly, a speech recognition keyword list needs to be established, based on the existing data, word2vec technology is used for text similarity analysis, and keywords with high word frequency are extracted and obtained to serve as a keyword library. The keyword library then needs to be manually screened and supplemented. If the test requirement is in pdf format, then Optical Character Recognition (OCR) technology is used to extract the characters. After all text segments are extracted, keyword filtering is carried out on each text, and the keywords come from a voice recognition keyword list. The text recognized during the character recognition is in sentence units, and any keyword can be filtered out if the text contains the keyword. And after filtering the keywords, sequencing the obtained results according to the probability related to the test requirement, wherein the probability of each section of text is calculated according to the proportion of the number of the keywords contained in the text to the total number of the keywords, and finally forming a key value pair which takes the keywords as an index and the text containing the keywords as content. The keyword lexicon comprises: language type, dialect, chinese, english, hybrid, conversation mode, single person, multiple person, spaced, continuous, wake up word, lexicon cover, medical, legal, financial, general, environmental, home, office, etc.

3) A test case generation module: the production module first needs to determine the test case template. The templates of test cases typically test case ID, test purpose, preconditions, input data and expected results. The test purpose is divided into an awakening rate in an XX environment and an identification rate in the XX environment. After the voice command (the awakening word preset in the device or software) sent by the user is recognized, the device enters a waiting command state from a dormant state and can be regarded as recognition of a range. Voice wake-up is usually the first step of human-computer interaction, and the device or software starts recording for voice recognition after detecting the wake-up word. The preconditions are angle, distance and volume of the speaker and volume of noise. The input data is divided into voice and noise types, and the speaking content, the language type and the expected result of the worker are specified by the test requirement. The content contained in the test case is matched with the key words in the output key value pair of the text extraction module, and the corresponding text content is found out. And then, carrying out structural analysis on each sentence to generate a structural analysis tree. The leaf nodes of the structural analysis tree of the sentence are the word segmentation results of the sentence, and the parent nodes of the leaf nodes are the part of speech of each word. And preferentially displaying the ranking and simple clauses as test cases to fill in alternative contents for the crowdsourcing platform staff to select. In the technology, the shift-reduce is mainly used for analyzing Chinese sentences.

The invention is characterized in that:

1. in the field of voice recognition testing, the application of crowdsourcing technology to performance evaluation is firstly proposed.

2. The computer vision technology is applied to the generation of the crowdsourced test case for the first time.

3. In the field of crowdsourcing test, the method for automatically extracting the content of the test case based on the test requirement is put forward for the first time, and the working efficiency of a crowdsourcing platform is improved.

Drawings

Fig. 1 is a general flow chart of the implementation of the present invention.

Fig. 2 is a flow chart of a text extraction module of the key step 2.

FIG. 3 is a prototype diagram of key step 3.

Fig. 4 and 5 are brain diagrams about a speech recognition test.

Detailed Description

The embodiments of the present invention are described below with reference to specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure of the present specification.

The method completes test case generation of voice recognition crowdsourcing test based on test requirements, mainly adopts an image understanding technology and a Chinese word segmentation technology, and relates to specific key technologies such as an OCR (optical character recognition) technology, a word2vec technology, a TF-IDF (Trans-inverse discrete frequency) technology, a syntax structure analysis technology and the like.

1. Recognizing text information

In the present invention we use OCR technology to identify rich textual information present in the test requirements. OCR refers to the process of an electronic device (e.g., a scanner or digital camera) examining a printed character, determining its shape by detecting dark and light patterns, and then translating the shape into computer text using character recognition methods. The method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.

2. Generating a corpus of keywords

In the invention, words in each piece of data in the training data set are mapped to a vector to judge the occurrence frequency of the words, thereby generating a speech recognition test keyword library. And meanwhile, the method is applied to the final test case generation module. Word2vec is the correlation model used to generate the Word vector. These models are shallow, two-layer neural networks. The network is represented in words and input words in adjacent positions are guessed. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.

3. Candidate element identification

In the invention, each sentence generated from the text extraction module is recognized as an element, and the importance of the element is judged by adopting a TF-IDF technology, so that an option for a crowdsourcing platform worker to select is generated in the test case generation module. The TF-IDF technique is a commonly used weighting technique for information retrieval and data mining. TF means term Frequency (termfequency), and IDF means Inverse text Frequency index (Inverse Document Frequency). For evaluating the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

4. Syntactic structure parsing

In the invention, the sentence which is the element identified from the text extraction module is subjected to structural analysis, and the parts of speech in the structural analysis tree are phrases, clauses, nouns and the like which are used as options for workers of a crowdsourcing platform to select. And (5) analyzing the sentence by using a shift-reduce method. The shift-reduce parser iteratively pushes the next input word onto the stack. If the first n items in the stack match the n items to the right of a production, they are all popped off the stack and the items to the left of the production are pushed onto the stack. Replacing the first n items with a single item is a reduce operation. This operation may only apply to the top of the stack, reducing items lower in the stack that must be completed before pushing subsequent items into the stack. When all inputs are consumed and there is only one item left on the stack, the parser ends, a parse tree rooted at the S node. The shift-reduce parser constructs a parse tree in the above process. Each time it pops n items from the stack, it will combine them into a partial parse tree and push it back to the stack.

In the invention, a text extraction part adopts publicly available OCR service to identify text content in test requirements, a keyword list used in filtering keywords and test parameters used in generating test cases are manually extracted from collected data, sentences containing any keyword are filtered out, and the text content is determined according to the keywords: the key-value pair form of the sentence is stored in a one-to-many relationship. The correlation degree of each sentence and the keywords is calculated by the proportion of the number of the keywords contained in the text to the total number of the keywords. In the test case generation module, a worker of the crowdsourcing platform can generate a plurality of test cases by one key only by selecting contents according to related texts and filling a test case template.

Claims

1. A speech recognition system crowdsourcing test case generation method based on test requirements is characterized in that a technology for testing a provided speech recognition system based on a crowdsourcing technology is used, a test requirement submitted by a task requester is identified and extracted, a speech recognition keyword list in a preset keyword library is used, a text related to the content of a test case is filtered from an identified text information fragment, then the text is in cross combination with a speech recognition environment parameter and is used as a candidate test requirement, a test module is generated, and defects in the speech recognition system are fully detected.

2. The extraction method of test requirements according to claim 1, wherein for a requirement document submitted by a client, keyword extraction is performed first, word2vec technology is used for text similarity analysis, keywords with high word frequency are extracted to be used as a keyword library, then manual screening and supplementation are performed on the keyword library, if the test requirement is in pdf format, Optical Character Recognition (OCR) technology is used for character extraction, each sentence generated in the text extraction module is recognized as an element, and TF-IDF technology is used for judging the importance of the element, so that an option for a crowdsourcing platform worker to select is generated in the test case generation module.

3. The speech recognition test parameters of claim 1, wherein the speech recognition test parameters are collected through a plurality of channels such as books, networks, etc. and are summarized and analyzed to form a set of highly versatile and flexible speech recognition test parameters, which include a plurality of dimensional information, such as the distance between the sound source and the device, the sound level, the noise level, the signal-to-noise ratio, the angle between the sound source and the device, etc.

4. The test case generation method of claim 1, wherein there is a specific test case template, the test case template includes a test case ID, a test purpose, a precondition, input data and an expected result, the test purpose is divided into an awake rate in XX environment and an identification rate in XX environment, the precondition is divided into an angle, a distance and a volume of a speaker and a volume of noise, the input data is divided into a voice and a noise type, a speaker's speech content and a language type, and a specific test case requirement text is generated by matching with a test requirement keyword to find out a corresponding text content for selection by a crowdsourcing platform worker.