CN113535913B - Answer scoring method and device, electronic equipment and storage medium - Google Patents

Answer scoring method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113535913B
CN113535913B CN202110614234.5A CN202110614234A CN113535913B CN 113535913 B CN113535913 B CN 113535913B CN 202110614234 A CN202110614234 A CN 202110614234A CN 113535913 B CN113535913 B CN 113535913B
Authority
CN
China
Prior art keywords
wake
word
preset
audio
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110614234.5A
Other languages
Chinese (zh)
Other versions
CN113535913A (en
Inventor
梁华东
李鑫
胡铭铭
黄倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202110614234.5A priority Critical patent/CN113535913B/en
Publication of CN113535913A publication Critical patent/CN113535913A/en
Application granted granted Critical
Publication of CN113535913B publication Critical patent/CN113535913B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The application discloses an answer scoring method and device, electronic equipment and storage medium, wherein the answer scoring method comprises the following steps: performing wake-up word detection on the answer audio to obtain a detection result; the method comprises the steps that answer audios are collected when a user answers a preset question, a detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on a preset answer of the preset question; and matching the detection result with a preset answer to obtain an answer score. By the scheme, the efficiency and accuracy of answer scoring can be improved.

Description

Answer scoring method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method and apparatus for scoring answers, an electronic device, and a storage medium.
Background
In real life, there are often application scenarios of questionnaire scoring such as cognitive impairment screening, mental health testing, etc. At present, the question and answer scoring is generally carried out in a manual face-to-face interaction question and answer mode, so that the efficiency is low; or, the customer service robot is used for transferring the voice into the text, keyword matching is carried out on the text so as to score questions and answers, and the spoken language quality of the tested person directly influences the accuracy of voice transfer, so that answer scoring is influenced. In view of this, how to improve the efficiency and accuracy of answer scoring is a highly desirable problem.
Disclosure of Invention
The application mainly solves the technical problem of providing an answer scoring method and device, electronic equipment and storage medium, and can improve the efficiency and accuracy of answer scoring.
In order to solve the above technical problem, a first aspect of the present application provides an answer scoring method, including: performing wake-up word detection on the answer audio to obtain a detection result; the method comprises the steps that answer audios are collected when a user answers a preset question, a detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on a preset answer of the preset question; and matching the detection result with a preset answer to obtain an answer score.
In order to solve the above technical problem, a second aspect of the present application provides an answer scoring apparatus, including: the wake-up detection module is used for carrying out wake-up word detection on the answer audio to obtain a detection result; the method comprises the steps that answer audios are collected when a user answers a preset question, a detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on a preset answer of the preset question; and the answer scoring module is used for matching the detection result with a preset answer to obtain an answer score.
In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the answer scoring method in the first aspect.
In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor for implementing the answer scoring method in the above first aspect.
According to the scheme, the answer audios are detected by the wake-up words to obtain the detection result, the answer audios are acquired when the user answers the preset questions, the detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, the wake-up word set is obtained based on the preset answers of the preset questions, on the basis, the detection result is matched with the preset answers to obtain the answer score, namely, in the answer scoring process, the answer score can be realized only by acquiring the answer audios of the user answer preset questions, so that the answer score is close to a human interaction form as much as possible, on the other hand, at least one target wake-up word in the answer audios can be obtained by detecting the answer audios, and the answer score can be obtained by matching the at least one target wake-up word with the preset answers, without performing voice transcription on the whole answer audios, so that the influence of the spoken language quality on the answer score is reduced as much as possible, and the efficiency and the accuracy of the answer score can be improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the answer scoring method of the present application;
FIG. 2 is a schematic diagram of a framework for voice wake-based answer scoring;
FIG. 3 is a process diagram of one embodiment of the answer scoring method of the application;
FIG. 4 is a flowchart illustrating an embodiment of step S11 in FIG. 1;
FIG. 5 is a flow diagram of one embodiment of obtaining a wake-up threshold;
FIG. 6 is a flowchart illustrating the step S11 of FIG. 1 according to another embodiment;
FIG. 7 is a flow chart of another embodiment of the answer scoring method of the application;
FIG. 8 is a schematic diagram of a frame of an embodiment of the answer scoring apparatus of the application;
FIG. 9 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;
FIG. 10 is a schematic diagram of a frame of an embodiment of a computer-readable storage medium of the present application.
Detailed Description
The following describes embodiments of the present application in detail with reference to the drawings.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.
Referring to fig. 1, fig. 1 is a flowchart illustrating an answer scoring method according to an embodiment of the application. It should be noted that any answer scoring method embodiment of the present application may be applied to any questionnaire scoring scenario such as cognitive impairment screening, mental health testing, and post-operative follow-up, and is not limited herein. Specifically, the method may include the steps of:
step S11: and carrying out wake-up word detection on the answer audio to obtain a detection result.
In the embodiment of the disclosure, the answer audio is collected when the user answers the preset question, the detection result includes at least one target wake-up word, and the at least one target wake-up word is from a wake-up word set, wherein the wake-up word set is obtained based on a preset answer of the preset question.
In one implementation scenario, the wake word set may be specifically created based on the score words of the preset answer. Taking the cognitive impairment screening scenario as an example, the preset questions may include "imagine you have much 1-, 5-, 10-membered money. Now you need to pay me 13 yuan, please pay me 3 ways. I will not find your change, you need to pay me 13 yuan of the whole ", the point of the score of this question is that the user can answer and give three or more payment combination modes, fully divide into 3 points, for example, the preset answer can be" one 10 yuan, three 1 yuan, two 5 yuan, three 1 yuan, thirteen 1 yuan, one 5 yuan, eight 1 yuan ", if the user provides three or more correct payment modes, 3 points can be obtained, if the user provides two correct payment modes, 2 points can be obtained, if the user provides one correct payment mode, 1 point can be obtained, and other conditions can be obtained as 0 points. On this basis, it can be considered that "1 element", "5 element", "10 element", "one", "two", "three", "eight", "thirteen" are all score words in the preset answer, so that the score words can be directly used as preset wake words to create a wake word set ("1 element", "5 element", "10 element", "one", "two", "three", "eight", "thirteen"). Other situations can be similar and are not exemplified here.
In another implementation scenario, in order to improve robustness of the answer score, the preset wake-up word may include at least one of a first wake-up word and a second wake-up word, where the first wake-up word is obtained by synonymously expanding the score word based on a preset initial consonant, and the second wake-up word is obtained by dialect conversion of the first wake-up word based on a preset dialect.
In a specific implementation scenario, the preset initial consonant and vowel may be set according to actual needs, for example, through data investigation and analysis, wake-up words commonly used in current voice wake-up interaction include the following forms and different change combinations based thereon: small + words (e.g., small, love), doublewords (e.g., questions), non-doublewords (e.g., dingdong), and combinations of "name + name" (e.g., small) are popular among them. In addition, since a Chinese character pronunciation is a syllable (tone+initial+final), in terms of tone, it is popular with either level (1 tone) or level (1 tone level+2 tone level), while in terms of initial, it is popular with zero initials (y, w), and in terms of final, it is popular with single final. Taking the foregoing preset problem as an example, the following first wake-up words (1 block, 5 blocks, 10 blocks, 1 element, 5 element, 10 element, 13 element, 2 element and 8 element) can be obtained according to the flat tone (one, three, ten, thirteen and element), zero initial consonant (one, five), single final (one), and synonymous expansion of "one" by the tongue tip rear sound to "one", and synonymous expansion of "one" by the single final to "one". In the case that the preset problem is another problem, the same can be said, and the examples are not given here.
In another specific implementation scenario, the preset dialect may also be set as "a fertilizer combination dialect", "a south-Beijing dialect", "a Hangzhou dialect", and so on according to actual needs. Still taking the above-mentioned preset problem as an example, the above-mentioned "block", "element" and the like can be converted into "coin", "son", "fur" and the like by using the syndication language.
In another specific implementation scenario, the score word may be further expanded in a combined manner, and the foregoing preset problem is still taken as an example, and the following wake-up words, "acanthopanax three", "ten plus three one" and so on may be obtained through combined expansion, which are not limited herein. In the case that the preset problem is another problem, the same can be said, and the examples are not given here.
It should be noted that the wake word set of the preset question may be created before the answer score is given by the user. That is, after the preset questions and the preset answers thereof are obtained, the wake-up word set of each preset question can be created.
In one implementation scenario, please refer to fig. 2 in combination, fig. 2 is a schematic diagram of a framework for voice wakeup based answer scoring. As shown in fig. 2, after the answer audio of the preset question is collected, VAD (Voice Activity Detection ) endpoint processing may be performed on the answer audio to locate a voice start position and a voice end position in the answer audio, so that a voiced segment may be extracted in the answer audio, and wake-up word detection may be performed on the voiced segment, so that when the answer audio faces a population with degraded spoken language expression ability, such as the elderly, the influence of a large number of silence segments or noise environments contained in the answer audio due to long-time thinking and pausing on the wake-up word detection is greatly relieved, which is beneficial to improving the real-time performance of wake-up word detection and reducing resource consumption. For specific processes of endpoint processing, reference may be made to VAD related technical details, which are not described herein.
In one implementation scenario, please continue with reference to fig. 2, wake word detection may be performed based on a wake engine and a wake word set, and the wake engine may specifically include, but is not limited to: HMM-GMM (Hidden Markov ModelGaussian Mixed Model, i.e., hidden markov model-gaussian mixture model), deep neural networks (e.g., convolutional neural networks, long and short term memory networks, deep separable convolutional neural networks, etc.), without limitation. The specific process of wake-up word detection by using the HMM-GMM and the deep neural network can refer to the details of the voice wake-up related technology, and will not be described herein.
Step S12: and matching the detection result with a preset answer to obtain an answer score.
Specifically, by detecting the wake-up word of the answer audio, the preset wake-up word of the wake-up word set, that is, the target wake-up word, can be detected. On the basis, the detection result containing the target wake-up word can be matched with a preset answer, and the answer score of the preset question is obtained.
In an implementation scenario, please continue to refer to fig. 2, according to the scoring rule of the preset question, if the preset question is a keyword detection type question (e.g. recall a type question, view an object, etc.), the detection result may be directly matched with the preset answer to obtain the answer score. Taking a preset question as an example of picture recognition, the preset question can be the picture recognition of four animals of peacock, zebra, butterfly and tiger, the preset answer can be the picture recognition of peacock, zebra, butterfly and tiger, the created wake-up word set can be the picture recognition of peacock, zebra, butterfly and tiger, the answer audio of the user can be the picture recognition of zebra, the user does not know the picture recognition, the detection result containing target wake-up words of zebra and tiger can be obtained through wake-up word detection, and on the basis, the detection result is matched with the preset answer by key words, and the answer score of the preset question can be determined to be 2 minutes due to successful matching of the zebra and the tiger. Other situations can be similar and are not exemplified here.
In an implementation scenario, please continue to refer to fig. 2, according to the scoring rule of the preset question, in the case that the preset question is a pattern extraction combination type question (e.g., digital reading, change giving, etc.), the detection result may be processed according to the corresponding rule of the preset question type and matched with the preset answer, so as to obtain the answer score of the user to the preset question. Still with the aforementioned preset problem, "imagine you have much 1-, 5-, 10-membered money. Now you need to pay me 13 yuan, please pay me 3 ways. I can not find change, you need to pay 13 yuan of I's whole' as an example, and the detection result comprises the following target wake-up words: the five-element, the unitary, the five-element, the ten-element, the coin, the two sheets, the five-element, the coin, the thirteen and the coin are subjected to fuzzy pattern extraction in the detection result to obtain the following combination: the above combinations can be fuzzy matched with the preset answer after < ten-element, coin >, < two-element, five-element, coin >, < thirteen, coin >, for example, < two-element, five-element, coin > can be fuzzy matched to < 2-element, 5-element, 3-element, 1-element >, and other combinations can be matched with the pushing lines, which are not described herein. In addition, when the user answers the preset questions, other numerical coins which are irrelevant to the preset answers, such as 1 piece 7 pieces, 1 piece 5 pieces, and 7 pieces are divided into 2 pieces 1 piece and 1 piece 5 pieces, in which case, a special identifier (such as 4) can be used to replace the other numerical coins (such as 7 pieces) so as to improve the precision of fuzzy matching.
In one implementation scenario, please refer to fig. 3 in combination, fig. 3 is a process diagram of an embodiment of the answer scoring method of the present application. As shown in fig. 3, in addition to the preset questions of the voice answer class, preset questions of the touch drawing class may be included. For example, the pentagon problem in MMSE (Mini-Mental State Examination, simple mental state scale) requires a user to draw two pentagons, which intersect to form a quadrilateral, with one vertex in the other pentagon. In this case, the touch data of the user on such preset questions may be obtained, and the touch data may be preprocessed, such as redundant trace point removal, stroke segmentation, stroke order determination, stroke trace smoothing, redundant stroke removal, etc., and then the preprocessed touch data may be scored by the discrimination engine to obtain the answer score of such preset questions. It should be noted that the above-mentioned discrimination engine may include, but is not limited to, a discrimination rule, a discrimination model (such as a support vector machine, a logistic regression model, a naive bayes model, a random forest model, etc.), which is not limited herein.
In one implementation scenario, please continue to refer to fig. 3, after the user answers all the preset questions, the answer scores of all the preset questions can be counted to obtain a comprehensive score, and the comprehensive score can be used for assisting in analysis in application scenarios such as cognitive disorder screening, mental health testing, postoperative follow-up and the like through instant scoring. Taking cognitive impairment screening as an example, by designing preset questions for detecting memory, voice, vision space, executive capability, calculation, understanding judgment and the like and obtaining answer scores of the preset questions in different aspects, a user can be assisted in analyzing whether cognitive impairment exists in the cognitive functions in the aspects, and cognitive impairment can be considered to exist when the cognitive impairment influences daily or social capability of the user. It should be noted that, in the cognitive impairment screening scenario, the preset problem may originate from: MMSE, moca_b (Montreal cognitive assessment-basic, i.e., montreal cognitive assessment scale—basic), etc., are not limited herein. Other scenarios may be so, and are not exemplified here.
According to the scheme, the answer audios are detected by the wake-up words to obtain the detection result, the answer audios are acquired when the user answers the preset questions, the detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, the wake-up word set is obtained based on the preset answers of the preset questions, on the basis, the detection result is matched with the preset answers to obtain the answer score, namely, in the answer scoring process, the answer score can be realized only by acquiring the answer audios of the user answer preset questions, so that the answer score is close to a human interaction form as much as possible, on the other hand, at least one target wake-up word in the answer audios can be obtained by detecting the answer audios, and the answer score can be obtained by matching the at least one target wake-up word with the preset answers, without performing voice transcription on the whole answer audios, so that the influence of the spoken language quality on the answer score is reduced as much as possible, and the efficiency and the accuracy of the answer score can be improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of step S11 in fig. 1. Specifically, the method may include the steps of:
Step S41: and carrying out wake-up word detection on the answer audio to obtain wake-up excitation of at least one candidate wake-up word.
In the embodiment of the disclosure, the plurality of candidate wake-up words are from a wake-up word set, and the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, wherein the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words. It should be noted that, the sample audio may be collected before the answer score, so as to obtain the wake-up word set of each preset question before the answer score.
In one implementation scenario, the sample audio related to the preset wake word may include a first audio, a second audio, and a third audio, where the first audio includes the preset wake word, the second audio includes a first reference word of the preset wake word, the third audio includes a second reference word of the preset wake word, the first reference word is synonymous with the preset wake word and different tones, and the second reference word is synonymous with the preset wake word and the tones. According to the mode, the sample audio related to the preset awakening word is set to comprise the first audio, the second audio and the third audio, the first audio comprises the preset awakening word, the second audio comprises the first reference word, the third audio comprises the second reference word, the first reference word is synonymous with the preset awakening word and is different in tone, the second reference word is synonymous with the preset awakening word, namely, the awakening threshold of the preset awakening word can be determined by combining the first audio, the second audio and the third audio together, the accuracy of the awakening threshold can be improved, and the accuracy of awakening word detection can be improved.
In a specific implementation scenario, after obtaining a wake word set of a preset problem, for each preset wake word, a first reference word synonymous with the preset wake word and a second reference word synonymous with homophones of the preset wake word may be obtained, and a first audio including the preset wake word, a second audio including the first reference word and a third audio including the second reference word are obtained.
In another specific implementation scenario, still with the aforementioned preset questions, "imagine you have much 1-, 5-, 10-membered money. Now you need to pay me 13 yuan, please pay me 3 ways. The me cannot find change, and you need to pay 13 yuan of me as an example, and the acquisition process of the wake-up word set can refer to the related description in the above disclosed embodiment, and the description is omitted here. Taking the preset wake-up word "coin" as an example, a first reference word "steel coin" of which the same sense as the preset wake-up word is different, and a second reference word "hard pen" of which the same sense as the preset wake-up word is not, can be acquired, and a first audio containing the preset wake-up word "coin" is acquired (for example, "please find me coins, do not need paper money", "this is a new version of monoblock coins", etc.), a second audio containing the first reference word is acquired (for example, "borrow me a steel coin", "i have no steel coin on me hands, find you paper money", etc.), and a third audio containing the second reference word is acquired (for example, "this old-strait hard pen handwriting is very good", "i have not practiced hard pen for a long time", etc.). Other scenarios may be so, and are not exemplified here.
Referring to fig. 5 in combination, fig. 5 is a flowchart illustrating an embodiment of obtaining a wake-up threshold. The method specifically comprises the following steps:
step S51: and respectively performing wake-up test by using the first audio, the second audio and the third audio to obtain a first data distribution of a preset wake-up word, a second data distribution of a first reference word and a third data distribution of a second reference word.
Specifically, the first data distribution comprises a first initial threshold and a first volume mean, the second data distribution comprises a second initial threshold and a second volume mean, and the third data distribution comprises a third initial threshold and a third volume mean. For ease of description, the first initial threshold may be denoted as S 0 The second initial threshold is denoted as S 1 And the third initial threshold is recorded as S 2 Similarly, the first volume average may be noted as v 0 The second volume average is recorded as v 1 And record the third volume average as v 2 The first data distribution of the preset wake-up word can be expressed as (S 0 ,v 0 ) The second data distribution of the first reference word may represent (S 1 ,v 1 ) The third data distribution of the second reference word may be expressed as (S 2 ,v 2 )。
In one implementation scenario, the volume average value of the voiced segments in the test audio may be counted to obtain the volume average value. The voiced segments in the test audio may be obtained by the VAD endpoint processing described above, and will not be described in detail herein. It should be noted that, in the case that the wake-up word is a preset wake-up word, the test audio represents a first audio, and the volume average represents a first volume average v 0 In the case where the wake-up word is the first reference word, the test audio represents the second audio, and the volume average represents the second volume average v 1 And in the case that the wake-up word is the second reference word, the test audio represents the third audio, and the volume average represents the third volume average v 2
In one implementation scenario, wake-up test can be performed on wake-up words by using test audio, wake-up success rates respectively corresponding to different test wake-up thresholds are counted and selected, and the test wake-up threshold with the wake-up success rate higher than a preset threshold is selected as an initial threshold. Specifically, after the wake-up test is performed on the wake-up word by using the test audio, wake-up excitation of the wake-up word can be obtained, wherein the higher the wake-up excitation is, the greater the possibility that the wake-up word is contained in the test audio is, and the lower the wake-up excitation is, the lower the possibility that the wake-up word is contained in the test audio is. In this case, different test wake-up thresholds (e.g., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, etc.) may be set, and wake-up success rates corresponding to wake-up words in the case of selecting different test wake-up thresholds may be counted. For example, if the wake-up stimulus is greater than the test wake-up threshold, it may be determined that the test audio includes a wake-up word, and if the test audio actually includes a wake-up word, the wake-up may be considered successful, based on which the different test wake-up thresholds may be respectively counted And selecting a test wake-up threshold with the wake-up success rate higher than a preset threshold as an initial threshold (for example, if the wake-up success rate corresponding to the test wake-up threshold 0.6 is higher than the preset threshold, the test wake-up threshold is used as the initial threshold). It should be noted that, in the case that the wake-up word is a preset wake-up word, the test audio represents the first audio, and the initial threshold represents the first initial threshold S 0 In the case where the wake-up word is the first reference word, the predicted audio represents the second audio, and the initial threshold represents the second initial threshold S 1 In the case that the wake-up word is the second reference word, the test audio represents the third audio, and the initial threshold represents the third initial threshold S 2
Step S52: and obtaining a first adjustment weight based on the difference between the first data distribution and the second data distribution, and obtaining a second adjustment weight based on the difference between the first data distribution and the third data distribution.
In one implementation, the difference between the first data distribution and the second data distribution may be measured to obtain a distribution gap d 1 And obtain a threshold difference S between the second initial threshold and the first initial threshold 1 -S 0 And acquiring a volume difference v between the second volume average value and the first volume average value 1 -v 0 Thereby based on the distribution gap d 1 Threshold difference S 1 -S 0 Sum volume difference v 1 -v 0 Obtaining a first adjustment weight f 1 And a first adjusting weight f 1 Distance from distribution d 1 Negative correlation, first adjustment weight f 1 Difference from threshold value S 1 -S 0 Positive correlation, first adjustment weight f 1 And volume difference v 1 -v 0 And (5) negative correlation. Specifically, a first adjustment weight f 1 Can be expressed as:
f 1 =((S 1 -S 0 )/d 1 )/((v 1 -v 0 )/v 0 )……(1)
in one implementation, the difference between the first data distribution and the third data distribution may be measured to obtain a distribution gap d 2 And obtainTaking a threshold difference S between the third initial threshold and the first initial threshold 2 -S 0 And acquiring a volume difference v between the third volume average value and the first volume average value 2 -v 0 Thereby based on the distribution gap d 2 Threshold difference S 2 -S 0 Sum volume difference v 2 -v 0 Obtaining a second adjustment weight f 2 And the second adjusting weight f 2 Distance from distribution d 2 Negative correlation, second adjustment weight f 2 Difference from threshold value S 2 -S 0 Positive correlation, second adjustment weight f 2 And volume difference v 2 -v 0 And (5) negative correlation. Specifically, the second adjustment weight f 2 Can be expressed as:
f 2 =((S 2 -S 0 )/d 2 )/((v 2 -v 0 )/v 0 )……(2)
in addition, the distribution gap can be obtained by utilizing JS divergence measurement. Taking the example of measuring the distribution gap between the first data distribution and the second data distribution, the JS divergence between the first data distribution and the second data distribution can be expressed as:
In the above formula (3), KL represents a KL divergence function, P g1 A distribution function representing a first data distribution, P g2 Representing a distribution function of the second data distribution. Note that the KL divergence may be expressed as:
the process of measuring the distribution gap between the first data distribution and the third data distribution can be similar, and will not be described herein. According to the method, the distribution gap between the data distribution is calculated, the adjustment weight and the distribution gap are set to be in negative correlation, the adjustment weight and the threshold value difference are set to be in positive correlation, and the volume difference is set to be in negative correlation, so that when the distribution gap between the preset wake-up word and the reference word is smaller, or the volume gap is smaller, or the threshold value difference is larger, the adjustment weight can be improved, the preset wake-up word and the reference word can be distinguished, and the wake-up success rate can be improved.
Step S53: and adjusting the first initial threshold value by using the first adjustment weight and the second adjustment weight to obtain a wake-up threshold value of the preset wake-up word.
Specifically, the first adjustment weight and the second adjustment weight may be utilized to determine a first adjustment ratio, where the first adjustment weight and the second adjustment weight are both positively related to the first adjustment ratio, and a sum of a product of the adjustment step length and the first adjustment ratio and a first initial threshold is used as the wake-up threshold. For ease of description, the wake-up threshold may be denoted as S m Wake threshold S m Can be expressed as:
S m =S 0 +(2*(f 1 *f 2 )/(f 1 +f 2 ))*50……(5)
in the above formula (5), 2 x (f 1 *f 2 )/(f 1 +f 2 ) Representing a first adjustment ratio, 50 represents an adjustment step size. The adjustment step length may be set to 30 or 40 according to actual needs, and is not limited herein. According to the mode, the first adjustment proportion is determined by utilizing the first adjustment weight and the second adjustment weight, the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion, the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold value is used as the wake-up threshold value, namely, the wake-up threshold value of the preset wake-up word is positively correlated with the first adjustment weight of the first reference word and the second adjustment weight of the second reference word on the basis of the first initial threshold value of the wake-up threshold value, and the accuracy of wake-up word detection is further improved.
In one implementation scenario, as described above, the preset wake-up word includes at least one of a first wake-up word and a second wake-up word, the first wake-up word is obtained by synonymously expanding the score word based on a preset initial consonant, and the second wake-up word is obtained by performing dialect conversion on the first wake-up word based on a preset dialect. On the basis, the wake-up threshold value of the first wake-up word is higher than the wake-up threshold value of the second wake-up word, and the false wake-up can be reduced by giving a higher wake-up threshold value to the first wake-up word, and the wake-up success rate can be improved by giving a lower wake-up threshold value to the second wake-up word.
In a specific implementation scenario, for both the first wake-up word (e.g. unary) and the second wake-up word (e.g. coin), the corresponding wake-up threshold S may be obtained by the steps described above m On the basis, for the first wake-up word (e.g. unary), the wake-up threshold S can be further set m (e.g., 780) is added to the preset up-regulation value (e.g., 70) to update the wake-up threshold S for the first wake-up word (e.g., unary) m (e.g., 850), and for a second wake-up word (e.g., coin), the wake-up threshold S may be further applied m (e.g., 670) is subtracted from a preset down-set value (e.g., 70) to update the wakeup threshold S for the second wakeup word (e.g., coin) m (e.g., 600). Other situations can be similar and are not exemplified here.
According to the mode, the first data distribution of the preset wake-up word and the third data distribution of the second reference word are obtained by respectively carrying out wake-up test on the first audio, the second audio and the third audio, the first adjustment weight is obtained based on the difference between the first data distribution and the second data distribution, the second adjustment weight is obtained based on the difference between the first data distribution and the third data distribution, and on the basis, the first initial threshold value is adjusted by the first adjustment weight and the second adjustment weight to obtain the wake-up threshold value of the preset wake-up word, so that the interference of homonym non-synonymous and synonymous different tones to the preset wake-up word can be reduced in the wake-up word detection process, and the wake-up success rate can be improved.
In one implementation scenario, still imagine that you have much 1-, 5-, 10-membered money with the preset question. Now you need to pay me 13 yuan, please pay me 3 ways. I can not find change, you need to pay for 13 yuan of I's whole, and under the condition that answer audio is 2 5 yuan and 3 1 yuan, wake-up excitation of candidate wake-up words 2, wake-up excitation of candidate wake-up words 5 yuan, wake-up excitation of candidate wake-up words 3 and wake-up excitation of candidate wake-up words 1 yuan can be obtained. Other situations can be similar and are not exemplified here.
Step S42: for each candidate wake word, determining whether to take the candidate wake word as a target wake word based on a magnitude relation between wake excitation and a wake threshold corresponding to the candidate wake word.
Specifically, for each candidate wake-up word, if the wake-up excitation is greater than the wake-up threshold corresponding to the candidate wake-up word, the candidate wake-up word may be used as the target wake-up word, otherwise, if the wake-up excitation is not greater than the wake-up threshold corresponding to the candidate wake-up word, the candidate wake-up word may not be used as the target wake-up word. Still with the aforementioned preset problem, "imagine you have much 1-, 5-, 10-membered money. Now you need to pay me 13 yuan, please pay me 3 ways. I can not find change, you need to pay 13 yuan of I's whole' as an example, for candidate wake-up words '2', if the wake-up excitation is larger than the corresponding wake-up threshold, then '2' candidate wake-up words can be added to the detection result as target wake-up words, otherwise '2' candidate wake-up words can not be used as target wake-up words, and for candidate wake-up words '5 yuan', '3', '1 yuan', and the like, no more examples are given here.
According to the scheme, the wake-up word detection is carried out on the answer audio to obtain the wake-up excitation of at least one candidate wake-up word, the at least one candidate wake-up word is from the wake-up word set, the wake-up word set comprises a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, the wake-up thresholds are obtained through sample audio tests related to the preset wake-up words, on the basis, whether the candidate wake-up words are used as target wake-up words or not is determined on the basis of the magnitude relation between the wake-up excitation and the wake-up thresholds corresponding to the candidate wake-up words, and in the wake-up word detection process, the wake-up thresholds corresponding to the preset wake-up words are obtained on the basis of sample audio tests related to the preset wake-up words, so that whether wake-up is carried out or not is determined through combination of the wake-up thresholds, and the wake-up success rate and error wake-up rate can be improved at the same time.
Referring to fig. 6, fig. 6 is a flowchart illustrating another embodiment of step S11 in fig. 1. In the embodiment of the disclosure, in the answer scoring process, the wake-up threshold value can be adaptively adjusted according to the actual environment. Specifically, the method may include the steps of:
step S61: and carrying out wake-up word detection on the answer audio to obtain wake-up excitation of at least one candidate wake-up word.
In the embodiment of the disclosure, at least one candidate wake-up word is from a wake-up word set, and the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, where the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words, and specifically, reference may be made to the related description in the foregoing embodiment of the disclosure, and details are not repeated herein.
Step S62: and obtaining the measured volume average value of the user in the environment of answering the questions.
In one implementation scenario, the measured volume average may be obtained before the user answers the preset question. For example, the user may be prompted to start answering the score and be asked to input a section of the environmental test audio through the microphone, so that VAD endpoint processing may be performed on the environmental test audio, and the average of the volume magnitudes of the voiced segments of the environmental test audio may be used as the measured volume average.
In another implementation scenario, as described above, if there are multiple preset questions in the whole answer scoring process, for each preset question, the measured volume average value may be obtained through statistics based on the preset questions that the user has answered. For example, when the user is about to answer the second preset question, the VAD endpoint processing can be performed on the answer audio when the user answers the first preset question, and the average value of the volume amplitude of the voiced segment is used as the measured volume average value; when the user is about to answer the third preset question, VAD endpoint processing can be performed on the answer audio of the user when the user answers the first preset question and the second preset question, and the average value of the volume amplitude of the voiced segments is taken as the measured volume average value, and the like, and no one-to-one example is given here.
Step S63: and determining a second adjustment ratio based on the measured volume average.
In the embodiment of the disclosure, the second adjustment ratio is positively correlated with the measured volume average value, that is, the larger the measured volume average value is, the larger the second adjustment ratio is, whereas the smaller the measured volume average value is, the smaller the second adjustment ratio is. For convenience of description, the second adjustment ratio may be denoted as λ, and the second adjustment ratio λ may be expressed as:
λ=(1+((v 3 -v 0 )/v 0 )*50)……(6)
in the above formula (6), v 3 Representing the measured volume mean value, v 0 The first volume average value representing the preset wake-up word is obtained based on the first audio frequency containing the preset wake-up word, and specifically, reference may be made to the related description in the foregoing disclosed embodiments, which is not repeated herein. In addition, 50 represents an adjustment step, and the adjustment step may be set to 30, 40, etc. according to practical application requirements, and is not limited herein.
In one implementation scenario, in a case where the measured volume average value is obtained before the user answers the preset question, the second adjustment proportion of each preset wake-up word in each wake-up word set may be determined based on the measured volume average value. That is, before the preset questions are answered, the second adjustment ratio of the wake-up threshold corresponding to each preset wake-up word in the wake-up word set corresponding to each preset question can be obtained.
In another implementation scenario, in a case where the measured volume average value is obtained based on the preset question statistics that have been answered before each preset question, for each preset question, the second adjustment proportion of each preset wake-up word in the wake-up word set corresponding to the preset question may be determined based on the measured volume average value obtained for the preset question statistics. That is, each time a preset question is answered, the second adjustment ratio of each preset wake-up word in the wake-up word set corresponding to the preset question is determined based on the measured volume average value obtained by statistics of the answered preset questions.
Step S64: and adjusting the wake-up threshold value by using the second adjustment proportion.
In one implementation scenario, when the measured volume average value is obtained before the user answers the preset question, the corresponding wake-up threshold value of each preset wake-up word may be adjusted by using the second adjustment proportion. As in the previously disclosed embodiments, the wake-up threshold may be noted as S m For ease of distinction, the wake-up threshold after adjustment may be noted as S f The wake-up threshold S after adjustment f Can be expressed as:
S f =S m *λ……(7)
in another implementation scenario, in the case where the measured volume average value is obtained based on the answered preset questions before each preset question, for each preset question, the corresponding wake-up threshold may be adjusted by using the second adjustment ratio of each preset wake-up word in the wake-up word set corresponding to the preset question, and the specific calculation process may be as shown in the above formula (7).
Step S65: for each candidate wake word, determining whether to wake the candidate wake word as a target based on a magnitude relationship between wake excitation and a wake threshold corresponding to the candidate wake word.
Reference may be made specifically to the foregoing disclosed embodiments, and details are not repeated here.
According to the scheme, before determining whether the candidate wake-up word is used as the target wake-up word or not based on the wake-up threshold corresponding to the candidate wake-up word, the actually measured volume average value of the user in the answer question environment is obtained, the second adjustment proportion is determined based on the actually measured volume average value, and the second adjustment proportion is positively correlated with the actually measured volume average value, and the wake-up threshold is further adjusted by the aid of the second adjustment proportion, so that self-adaptive adjustment of the wake-up threshold can be achieved in the answer scoring process, and accuracy of answer scoring is improved.
Referring to fig. 7, fig. 7 is a flowchart illustrating an answer scoring method according to another embodiment of the application.
Specifically, the method may include the steps of:
step S71: and carrying out wake-up word detection on the answer audio to obtain a detection result.
In the embodiment of the disclosure, the answer audio is collected when the user answers the preset question, the detection result includes at least one target wake-up word, and the at least one target wake-up word is from a wake-up word set, the wake-up word set is obtained based on a preset answer of the preset question, the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, and the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words.
Step S72: the wake-up threshold is modified based on the difference between the detection result and the scoring result.
In the embodiment of the disclosure, the scoring result includes at least one actual wake-up word, and the at least one actual wake-up word is included in the answer audio and is from the wake-up word set. That is, the actual wake-up word is a preset wake-up word actually contained in the answer audio.
Specifically, referring to fig. 3 in combination, by comparing the difference between the detection result and the scoring result, a preset wake-up word that should be awakened but not awakened and a preset wake-up word that should not be awakened can be determined, and on this basis, wake-up thresholds corresponding to the two preset wake-up words, the preset wake-up word that should not be awakened but not awakened and the preset wake-up word that should not be awakened, can be corrected. Specifically, the preset wake-up words which are not awakened and are not awakened can be turned down, and the preset wake-up words which are not awakened and are awakened can be turned up, so that the wake-up success rate is improved, and the false wake-up rate is reduced.
In one implementation scenario, still imagine that you have much 1-, 5-, 10-membered money with the preset question. Now you need to pay me 13 yuan, please pay me 3 ways. The me cannot find change, you need to pay 13 yuan of me as an example, the corresponding wake-up word set can refer to the above disclosed embodiment, and the answer audio is "2 5 yuan, 3 1 yuan", i.e. the preset wake-up words to be waken include: "2", "5", "3", "1" and "1" as measured results include: under the condition of 4 target wake words, namely '2', '5', '3', '10', the preset wake words which should be awakened but not awakened are '1 yuan' and the preset wake words which should not be awakened but awakened are '10 yuan', the corresponding wake threshold value of the preset wake word '1 yuan' can be properly adjusted down, and the corresponding wake threshold value of the preset wake word '10 yuan' can be properly improved. Other situations can be similar and are not exemplified here.
In another implementation scenario, embodiments of the present disclosure may be specifically executed in the early stage of application, so that the wake-up threshold is corrected in time if it is still not accurate enough in the early stage of application. Alternatively, embodiments of the present disclosure may specifically be executed during the testing phase to perform internal testing prior to application, so as to correct the wake-up threshold in time if it is still not accurate enough.
According to the scheme, the wake-up word is detected on the answer audio, the detection result is obtained, the wake-up threshold is corrected based on the difference between the detection result and the scoring result, and the accuracy of the wake-up threshold can be improved, so that the wake-up success rate can be improved at the same time, and the false wake-up rate is reduced.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating an answer scoring apparatus 80 according to an embodiment of the application. The answer scoring device 80 comprises a wake-up detection module 81 and an answer scoring module 82, wherein the wake-up detection module 81 is used for carrying out wake-up word detection on answer audios to obtain detection results; the method comprises the steps that answer audios are collected when a user answers a preset question, a detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on a preset answer of the preset question; the answer scoring module 82 is configured to obtain an answer score by matching the detection result with a preset answer.
According to the scheme, in the answer scoring process, answer scoring can be achieved only by collecting answer audios of the preset questions of the user, so that the answer scoring is close to a human interaction form as much as possible, at least one target wake-up word in the answer audios can be obtained only by detecting the wake-up word of the answer audios, the answer scoring is obtained by matching the at least one target wake-up word with the preset answers, voice transcription of the whole answer audios is not needed, influence of spoken language quality on the answer scoring is reduced as much as possible, and therefore efficiency and accuracy of the answer scoring can be improved.
In some disclosed embodiments, the wake-up detection module 81 includes a wake-up word detection sub-module, configured to perform wake-up word detection on the answer audio, to obtain a wake-up stimulus of at least one candidate wake-up word; the method comprises the steps that at least one candidate wake-up word is from a wake-up word set, the wake-up word set comprises a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, and the wake-up thresholds are obtained by utilizing sample audio tests related to the preset wake-up words; the wake detection module 81 includes a target wake word obtaining sub-module, configured to determine, for each candidate wake word, whether to use the candidate wake word as the target wake word based on a magnitude relation between a wake stimulus and a wake threshold corresponding to the candidate wake word.
Therefore, the wake-up threshold corresponding to each preset wake-up word is obtained based on the sample audio test related to the preset wake-up word, so that whether the user wakes up or not is determined by combining the wake-up threshold, and the wake-up success rate and the false wake-up rate can be improved at the same time.
In some disclosed embodiments, the sample audio associated with the preset wake word includes a first audio, a second audio, and a third audio; the first audio comprises a preset awakening word, the second audio comprises a first reference word of the preset awakening word, the third audio comprises a second reference word of the preset awakening word, the first reference word and the preset awakening word are synonymous with different sounds, and the second reference word and the preset awakening word are synonymous with homophones.
Therefore, the sample audio related to the preset wake-up word is set to comprise the first audio, the second audio and the third audio, the first audio comprises the preset wake-up word, the second audio comprises the first reference word, the third audio comprises the second reference word, the first reference word is synonymous with the preset wake-up word and is different in tone, the second reference word is synonymous with the preset wake-up word, namely, the wake-up threshold of the preset wake-up word can be determined by combining the first audio, the second audio and the third audio together, the accuracy of the wake-up threshold can be improved, and the detection accuracy of the wake-up word is improved.
In some disclosed embodiments, the answer scoring apparatus 80 includes a threshold obtaining module, and the threshold obtaining module includes a wake-up test sub-module, configured to perform a wake-up test by using the first audio, the second audio, and the third audio, respectively, to obtain a first data distribution of a preset wake-up word, a second data distribution of a first reference word, and a third data distribution of a second reference word; the first data distribution comprises a first initial threshold value and a first volume mean value, the second data distribution comprises a second initial threshold value and a second volume mean value, and the third data distribution comprises a third initial threshold value and a third volume mean value; the threshold value acquisition module comprises a weight value acquisition sub-module which is used for obtaining a first adjustment weight value based on the difference between the first data distribution and the second data distribution and obtaining a second adjustment weight value based on the difference between the first data distribution and the third data distribution; the threshold obtaining module comprises an adjustment initial sub-module, and is used for adjusting the first initial threshold by using the first adjustment weight and the second adjustment weight to obtain a wake-up threshold of a preset wake-up word.
Therefore, the first data distribution of the preset wake-up word and the third data distribution of the second reference word are obtained by respectively carrying out wake-up test on the first audio, the second audio and the third audio, the first adjustment weight is obtained based on the difference between the first data distribution and the second data distribution, the second adjustment weight is obtained based on the difference between the first data distribution and the third data distribution, and on the basis, the first initial threshold value is adjusted by the first adjustment weight and the second adjustment weight to obtain the wake-up threshold value of the preset wake-up word, so that the interference of homonym non-synonymous and synonymous different tones to the preset wake-up word in the wake-up word detection process can be reduced, and the wake-up success rate can be improved.
In some disclosed embodiments, the wake-up test sub-module includes a mean value statistics unit, configured to count a mean value of volume magnitudes of a voiced segment in the test audio, to obtain a volume mean value; the wake-up test sub-module comprises a threshold test unit, a wake-up unit and a wake-up unit, wherein the threshold test unit is used for carrying out wake-up test on wake-up words by using test audio, counting wake-up success rates respectively corresponding to different test wake-up thresholds, and selecting the test wake-up threshold with the wake-up success rate higher than a preset threshold as an initial threshold; the method comprises the steps that under the condition that a wake-up word is a preset wake-up word, test audio is first audio, a volume average value is first volume average value, an initial threshold value is a first initial threshold value, under the condition that the wake-up word is a first reference word, the test audio is second audio, the volume average value is second volume average value, the initial threshold value is a second initial threshold value, under the condition that the wake-up word is a second reference word, the test audio is third audio, the volume average value is third volume average value, and the initial threshold value is third initial threshold value.
Therefore, the volume average value is obtained by counting the volume amplitude value average value of the voiced segments in the test audio, the wake-up words are subjected to wake-up test by the test audio, the wake-up success rates respectively corresponding to different test wake-up thresholds are counted and selected, and the test wake-up threshold with the wake-up success rate higher than the preset threshold is selected as the initial threshold, so that the accuracy of data distribution can be improved.
In some disclosed embodiments, the weight acquisition submodule includes a gap measurement unit for measuring a difference between the first data distribution and the candidate data distribution to obtain a distribution gap; the candidate data distribution comprises a candidate initial threshold value and a candidate volume average value; the weight obtaining sub-module comprises a difference value calculating unit, a first volume average value and a second volume average value, wherein the difference value calculating unit is used for obtaining a threshold value difference value between the candidate initial threshold value and the first initial threshold value and obtaining a volume difference value between the candidate volume average value and the first volume average value; the weight acquisition submodule comprises a weight calculation unit and a weight calculation unit, wherein the weight calculation unit is used for obtaining an adjustment weight based on the distribution gap, the threshold value difference and the volume difference; the adjustment weight is inversely related to the distribution gap, the adjustment weight is inversely related to the threshold difference, the adjustment weight is inversely related to the volume difference, and the adjustment weight is a first adjustment weight when the candidate data distribution is a second data distribution, and the adjustment weight is a second adjustment weight when the candidate data distribution is a third data distribution.
Therefore, by calculating the distribution gap between the data distributions, setting the adjustment weight and the distribution gap as negative correlation, setting the adjustment weight and the threshold value difference as positive correlation, and setting the volume difference as negative correlation, the adjustment weight can be improved when the distribution gap between the preset wake-up word and the reference word thereof is smaller, or the volume gap is smaller, or the threshold value difference is larger, so as to distinguish the preset wake-up word and the reference word thereof, thereby being beneficial to improving the wake-up success rate.
In some disclosed embodiments, the adjustment initiation sub-module includes a first scale acquisition unit configured to determine a first adjustment scale using the first adjustment weight and the second adjustment weight; wherein, the first adjusting weight and the second adjusting weight are positively correlated with the first adjusting proportion; the adjustment initiation sub-module includes a wake-up threshold calculation unit for taking the sum of the product of the adjustment step size and the first adjustment ratio and the first initial threshold as a wake-up threshold.
Therefore, the first adjustment proportion is determined by utilizing the first adjustment weight and the second adjustment weight, the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion, and on the basis, the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold value is used as the wake-up threshold value, namely, the wake-up threshold value of the preset wake-up word is positively correlated with the first adjustment weight of the first reference word and the second adjustment weight of the second reference word on the basis of the first initial threshold value, so that the accuracy of wake-up word detection is further improved.
In some disclosed embodiments, the wake-up detection module 81 includes a volume actual measurement sub-module, configured to obtain an actual measurement volume average value of a user in an environment where a question is answered; the wake-up detection module 81 includes a second proportion acquisition sub-module for determining a second adjustment proportion based on the measured volume average; wherein the second adjustment ratio is positively correlated with the measured volume average; the wake-up detection module 81 includes a threshold adjustment sub-module for adjusting a wake-up threshold using the second adjustment ratio.
Therefore, before determining whether to take the candidate wake-up word as the target wake-up word based on the wake-up threshold corresponding to the candidate wake-up word, acquiring the actually measured volume average value of the user in the answer question environment, determining a second adjustment proportion based on the actually measured volume average value, and positively correlating the second adjustment proportion with the actually measured volume average value, and further adjusting the wake-up threshold by using the second adjustment proportion, so that the self-adaptive adjustment of the wake-up threshold can be realized in the answer scoring process, and the accuracy of the answer scoring is improved.
In some disclosed embodiments, the volume actual measurement submodule is specifically configured to obtain an actual measurement volume average value before a user answers a preset question; the second proportion obtaining sub-module is specifically used for respectively determining a second adjustment proportion of each preset wake-up word in each wake-up word set based on the actually measured volume average value; the threshold adjustment submodule is specifically configured to, for each preset wake-up word, adjust a corresponding wake-up threshold value by using a second adjustment proportion of the preset wake-up word.
Therefore, before the user answers the preset questions, the wake-up thresholds corresponding to the preset wake-up words in the wake-up word sets corresponding to the preset questions are uniformly and adaptively adjusted, and the accuracy of the wake-up thresholds can be further improved on the basis of reducing the complexity of the adaptive adjustment.
In some disclosed embodiments, the volume actual measurement submodule is specifically configured to, for each preset question, calculate an actual measurement volume average value based on the preset questions that the user has answered; the second proportion obtaining submodule is specifically used for determining a second adjustment proportion of each preset awakening word in the awakening word set corresponding to each preset problem based on an actual measurement volume average value obtained through statistics of the preset problems; the threshold adjustment sub-module is specifically configured to adjust, for each preset problem, a corresponding wake-up threshold by using a second adjustment proportion of each preset wake-up word in the corresponding wake-up word set.
Therefore, in the process that the user answers the preset questions, the self-adaptive adjustment is performed on the wake-up threshold value corresponding to each preset wake-up word in the wake-up word set corresponding to each question, so that the precision of the self-adaptive adjustment can be improved, and the accuracy of the wake-up threshold value can be improved as much as possible.
In some disclosed embodiments, the wake word set is created based on a score word of a preset answer, and the preset wake word includes at least one of a first wake word and a second wake word; the first wake-up words are obtained by synonymously expanding score words based on preset initial consonants, and the second wake-up words are obtained by performing dialect conversion on the first wake-up words based on preset dialects.
Therefore, the wake-up word set is created based on the score word of the preset answer, the preset wake-up word comprises at least one of a first wake-up word and a second wake-up word, the first wake-up word is obtained by synonymously expanding the score word based on the preset initial consonant, and the second wake-up word is obtained by performing dialect conversion on the first wake-up word based on the preset dialect, so that the robustness of the answer score can be improved.
In some disclosed embodiments, the wake threshold of the first wake word is higher than the wake threshold of the second wake word.
Therefore, by assigning a higher wake-up threshold to the first wake-up word, false wake-up can be reduced, and by assigning a lower wake-up threshold to the second wake-up word, wake-up success rate can be improved.
Referring to fig. 9, fig. 9 is a schematic diagram of a frame of an electronic device 90 according to an embodiment of the application. The electronic device 90 comprises a memory 91 and a processor 92 coupled to each other, the memory 91 having stored therein program instructions, the processor 92 being adapted to execute the program instructions to implement the steps of any of the answer scoring method embodiments described above. In particular, electronic device 90 may include, but is not limited to: desktop computers, notebook computers, cell phones, tablet computers, and the like, are not limited herein.
In particular, the processor 92 is operative to control itself and the memory 91 to implement the steps in any of the answer scoring method embodiments described above. The processor 92 may also be referred to as a CPU (Central Processing Unit ). The processor 92 may be an integrated circuit chip with signal processing capabilities. The processor 92 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 92 may be commonly implemented by an integrated circuit chip.
According to the scheme, in the answer scoring process, answer scoring can be achieved only by collecting answer audios of the preset questions of the user, so that the answer scoring is close to a human interaction form as much as possible, at least one target wake-up word in the answer audios can be obtained only by detecting the wake-up word of the answer audios, the answer scoring is obtained by matching the at least one target wake-up word with the preset answers, voice transcription of the whole answer audios is not needed, influence of spoken language quality on the answer scoring is reduced as much as possible, and therefore efficiency and accuracy of the answer scoring can be improved.
Referring to fig. 10, fig. 10 is a schematic diagram illustrating a frame of an embodiment of a computer readable storage medium 100 according to the present application. The computer readable storage medium 100 stores program instructions 101 executable by a processor, the program instructions 101 for implementing the steps of any of the answer scoring method embodiments described above.
According to the scheme, in the answer scoring process, answer scoring can be achieved only by collecting answer audios of the preset questions of the user, so that the answer scoring is close to a human interaction form as much as possible, at least one target wake-up word in the answer audios can be obtained only by detecting the wake-up word of the answer audios, the answer scoring is obtained by matching the at least one target wake-up word with the preset answers, voice transcription of the whole answer audios is not needed, influence of spoken language quality on the answer scoring is reduced as much as possible, and therefore efficiency and accuracy of the answer scoring can be improved.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (14)

1. A method of answer scoring comprising:
performing wake-up word detection on the answer audio to obtain a detection result; the answer audio is collected when a user answers a preset question, the detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on the preset answer of the preset question;
matching the detection result with the preset answer to obtain an answer score;
the method for detecting the wake-up word of the answer audio comprises the following steps of:
performing wake-up word detection on the answer audio to obtain wake-up excitation of at least one candidate wake-up word; the at least one candidate wake-up word is from the wake-up word set, the wake-up word set comprises a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words, and wake-up excitation of the candidate wake-up words represents the possibility that the answer audio contains the corresponding candidate wake-up words;
for each candidate wake-up word, determining whether to take the candidate wake-up word as the target wake-up word based on a magnitude relation between the wake-up stimulus and a wake-up threshold corresponding to the candidate wake-up word.
2. The method of claim 1, wherein the sample audio associated with the preset wake word comprises a first audio, a second audio, and a third audio;
the first audio comprises the preset wake-up word, the second audio comprises a first reference word of the preset wake-up word, the third audio comprises a second reference word of the preset wake-up word, the first reference word and the preset wake-up word are synonymous with different sounds, and the second reference word and the preset wake-up word are synonymous with each other.
3. The method according to claim 2, wherein the wake threshold acquisition step of the preset wake word comprises:
respectively performing wake-up tests by using the first audio, the second audio and the third audio to obtain first data distribution of the preset wake-up word, second data distribution of the first reference word and third data distribution of the second reference word; the first data distribution comprises a first initial threshold value and a first volume mean value, the second data distribution comprises a second initial threshold value and a second volume mean value, and the third data distribution comprises a third initial threshold value and a third volume mean value;
Obtaining a first adjustment weight based on the difference between the first data distribution and the second data distribution, and obtaining a second adjustment weight based on the difference between the first data distribution and the third data distribution;
and adjusting the first initial threshold by using the first adjustment weight and the second adjustment weight to obtain the wake-up threshold of the preset wake-up word.
4. The method of claim 3, wherein performing a wake-up test with the first audio, the second audio, and the third audio, respectively, to obtain a first data distribution of the preset wake-up word, a second data distribution of the first reference word, and a third data distribution of the second reference word, comprises:
counting the average value of the volume amplitude values of the voiced segments in the test audio to obtain the average value of the volume; the method comprises the steps of,
performing wake-up test on wake-up words by using test audio, counting wake-up success rates respectively corresponding to different test wake-up thresholds, and selecting a test wake-up threshold with the wake-up success rate higher than a preset threshold as an initial threshold;
the method comprises the steps that under the condition that the wake-up word is the preset wake-up word, the test audio is first audio, the volume average value is the first volume average value, the initial threshold value is the first initial threshold value, under the condition that the wake-up word is the first reference word, the test audio is second audio, the volume average value is the second volume average value, the initial threshold value is the second initial threshold value, under the condition that the wake-up word is the second reference word, the test audio is third audio, the volume average value is the third volume average value, and the initial threshold value is the third initial threshold value.
5. A method according to claim 3, wherein the deriving a first adjustment weight based on a difference between the first data distribution and the second data distribution, or the deriving a second adjustment weight based on a difference between the first data distribution and the third data distribution, comprises:
measuring the difference between the first data distribution and the candidate data distribution to obtain a distribution gap; the candidate data distribution comprises a candidate initial threshold value and a candidate volume mean value; the method comprises the steps of,
acquiring a threshold difference value between the candidate initial threshold value and the first initial threshold value, and acquiring a volume difference value between the candidate volume average value and the first volume average value;
obtaining an adjustment weight based on the distribution gap, the threshold difference and the volume difference;
the adjustment weight is inversely related to the distribution gap, the adjustment weight is inversely related to the threshold difference, the adjustment weight is inversely related to the volume difference, and the adjustment weight is the first adjustment weight when the candidate data distribution is the second data distribution, and the adjustment weight is the second adjustment weight when the candidate data distribution is the third data distribution.
6. The method of claim 3, wherein the adjusting the first initial threshold using the first adjustment weight and the second adjustment weight to obtain the wake-up threshold of the preset wake-up word comprises:
determining a first adjustment proportion by using the first adjustment weight and the second adjustment weight; wherein, the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion;
and taking the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold value as the wake-up threshold value.
7. The method of claim 1, wherein prior to the determining whether to treat the candidate wake word as the target wake word based on a magnitude relationship between the wake stimulus and a wake threshold corresponding to the candidate wake word for each of the candidate wake words, the method further comprises:
acquiring an actually measured volume average value of a user in a question answering environment;
determining a second adjustment ratio based on the measured volume average; wherein the second adjustment ratio is positively correlated with the measured volume average;
and adjusting the wake-up threshold by using the second adjustment proportion.
8. The method of claim 7, wherein the preset questions have a plurality of questions, and each of the preset questions corresponds to the wake word set; the obtaining the measured volume average value of the user in the environment of answering the questions comprises the following steps:
before a user answers the preset questions, acquiring the measured volume average value;
the determining a second adjustment ratio based on the measured volume average value includes:
based on the measured volume average value, respectively determining a second adjustment proportion of each preset wake-up word in each wake-up word set;
the adjusting the wake-up threshold using the second adjustment ratio includes:
and for each preset wake-up word, adjusting a corresponding wake-up threshold value by using a second adjustment proportion of the preset wake-up word.
9. The method of claim 7, wherein the preset questions have a plurality of questions, and each of the preset questions corresponds to the wake word set; the obtaining the measured volume average value of the user in the environment of answering the questions comprises the following steps:
for each preset question, based on the preset questions answered by the user, counting to obtain the measured volume average value;
The determining a second adjustment ratio based on the measured volume average value includes:
for each preset problem, determining a second adjustment proportion of each preset wake-up word in the wake-up word set corresponding to the preset problem based on the measured volume average value obtained through statistics of the preset problem;
the adjusting the wake-up threshold using the second adjustment ratio includes:
and for each preset problem, adjusting a corresponding wake-up threshold value by using a second adjustment proportion of each preset wake-up word in the corresponding wake-up word set.
10. The method of claim 1, wherein the set of wake words is created based on the score words of the preset answer, and the preset wake words include at least one of a first wake word and a second wake word;
the first wake-up word is obtained by synonymously expanding the score word based on a preset initial consonant, and the second wake-up word is obtained by performing dialect conversion on the first wake-up word based on a preset dialect.
11. The method of claim 10, wherein a wake threshold of the first wake word is higher than a wake threshold of the second wake word.
12. An answer scoring apparatus, comprising:
the wake-up detection module is used for carrying out wake-up word detection on the answer audio to obtain a detection result; the answer audio is collected when a user answers a preset question, the detection result comprises at least one target wake-up word, the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on the preset answer of the preset question;
the answer scoring module is used for matching the detection result with the preset answer to obtain an answer score;
the wake-up detection module comprises a wake-up word detection sub-module and a target wake-up word acquisition sub-module, wherein the wake-up word detection sub-module is used for carrying out wake-up word detection on the answer audio to obtain wake-up excitation of at least one candidate wake-up word, the at least one candidate wake-up word is from a wake-up word set, the wake-up word set comprises a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, the wake-up thresholds are obtained by utilizing sample audio tests related to the preset wake-up words, and the wake-up excitation of the candidate wake-up words represents the possibility that the answer audio contains the corresponding candidate wake-up words; the target wake-up word acquisition submodule is used for determining whether to take the candidate wake-up word as the target wake-up word or not based on the magnitude relation between the wake-up excitation and the wake-up threshold corresponding to the candidate wake-up word for each candidate wake-up word.
13. An electronic device comprising a memory and a processor coupled to each other, the memory having program instructions stored therein, the processor being configured to execute the program instructions to implement the answer scoring method of any one of claims 1 to 11.
14. A computer readable storage medium, characterized in that program instructions executable by a processor for implementing the answer scoring method of any one of claims 1 to 11 are stored.
CN202110614234.5A 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium Active CN113535913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110614234.5A CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110614234.5A CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113535913A CN113535913A (en) 2021-10-22
CN113535913B true CN113535913B (en) 2023-12-01

Family

ID=78095006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110614234.5A Active CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113535913B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999161A (en) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 Implementation method and application of voice awakening module
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
CN109815491A (en) * 2019-01-08 2019-05-28 平安科技(深圳)有限公司 Answer methods of marking, device, computer equipment and storage medium
CN111126553A (en) * 2019-12-25 2020-05-08 平安银行股份有限公司 Intelligent robot interviewing method, equipment, storage medium and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070010330A1 (en) * 2005-01-04 2007-01-11 Justin Cooper System and method forming interactive gaming over a TV network
TW202011384A (en) * 2018-09-13 2020-03-16 廣達電腦股份有限公司 Speech correction system and speech correction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999161A (en) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 Implementation method and application of voice awakening module
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
CN109815491A (en) * 2019-01-08 2019-05-28 平安科技(深圳)有限公司 Answer methods of marking, device, computer equipment and storage medium
CN111126553A (en) * 2019-12-25 2020-05-08 平安银行股份有限公司 Intelligent robot interviewing method, equipment, storage medium and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
应用于问答系统的Lucene相似度检索算法改进;白菊;何聚厚;;计算机技术与发展(第11期);85-88 *

Also Published As

Publication number Publication date
CN113535913A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
US11749414B2 (en) Selecting speech features for building models for detecting medical conditions
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
Li et al. Spoken language recognition: from fundamentals to practice
Graciarena et al. Combining prosodic lexical and cepstral systems for deceptive speech detection
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
Van Nuffelen et al. Speech technology‐based assessment of phoneme intelligibility in dysarthria
Sahidullah et al. Local spectral variability features for speaker verification
CN111311327A (en) Service evaluation method, device, equipment and storage medium based on artificial intelligence
US20170263245A1 (en) Conversation analyzing device, conversation analyzing method, and program
Yin et al. Automatic cognitive load detection from speech features
CN102184654B (en) Reading supervision method and device
El Ayadi et al. Text-independent speaker identification using robust statistics estimation
George et al. Analysis of cosine distance features for speaker verification
Bayerl et al. Detecting vocal fatigue with neural embeddings
Tsai et al. Self-defined text-dependent wake-up-words speaker recognition system
Islam et al. Noise-robust text-dependent speaker identification using cochlear models
Mirhassani et al. Fuzzy-based discriminative feature representation for children's speech recognition
CN113535913B (en) Answer scoring method and device, electronic equipment and storage medium
Taran A nonlinear feature extraction approach for speech emotion recognition using VMD and TKEO
Ghoniem et al. A novel Arabic text-independent speaker verification system based on fuzzy hidden markov model
Shah et al. Hitting three birds with one system: A voice-based CAPTCHA for the modern user
Gupta et al. Literature survey and review of techniques used for automatic assessment of Stuttered Speech
Singh et al. Automatic articulation error detection tool for Punjabi language with aid for hearing impaired people
Hinterleitner et al. Predicting the quality of text-to-speech systems from a large-scale feature set.
Yarra et al. Automatic native language identification using novel acoustic and prosodic feature selection strategies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant