CN108536668B

CN108536668B - Wake-up word evaluation method and device, storage medium and electronic equipment

Info

Publication number: CN108536668B
Application number: CN201810159653.2A
Authority: CN
Inventors: 吴国兵; 潘嘉; 王海坤
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-02-26
Filing date: 2018-02-26
Publication date: 2022-06-07
Anticipated expiration: 2038-02-26
Also published as: CN108536668A

Abstract

The disclosure provides a wake-up word evaluation method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring a word to be evaluated input by a user; extracting evaluation features of the words to be evaluated, wherein the evaluation features are used for expressing the distinguishability of the words to be evaluated at an acoustic level and/or a semantic level; and taking the evaluation characteristics of the word to be evaluated as input, and determining whether the word to be evaluated is suitable to be used as a wake-up word after being processed by a pre-constructed wake-up word evaluation model. According to the scheme, the accuracy of the awakening word evaluation result is improved, and the awakening effect of the awakening word set by the user is improved.

Description

Wake-up word evaluation method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for assessing a wake-up word, a storage medium, and an electronic device.

Background

The voice awakening technology is an important branch in the technical field of voice signal processing, and has important application in the aspects of intelligent home, intelligent robots, intelligent car machines, intelligent mobile phones and the like.

In the practical application process, the intelligent terminal captures voice data input by a user, a pre-constructed awakening model is used for awakening word recognition, and if the voice data is recognized as an awakening word, awakening is successful; otherwise, the awakening fails.

In order to improve the use experience of the user, the user can set a personalized awakening word according to the requirement. Meanwhile, in order to ensure the awakening effect, when the user sets the awakening word, the user needs to perform awakening word evaluation first, and whether the awakening word set by the user is appropriate is judged.

The current wake word evaluation is mainly implemented according to experience or rules. Specifically, a word to be evaluated set by a user can be acquired, whether the word to be evaluated meets a preset evaluation condition or not is judged, and if yes, the word to be evaluated is suitable to be used as a wake-up word. For example, the preset evaluation condition may include: the length of the word exceeds the preset length; and/or, the difference between syllables included in the words is greater than a preset difference. The length of the word can be embodied as the number of characters included in the word and/or the audio time length of the voice data corresponding to the word; the difference between syllables can be represented as whether the adjacent syllables are the same or not, and then the number of different adjacent syllables is counted and compared with the preset difference.

In the awakening word evaluation process realized based on experience or rules, the accuracy of the evaluation result is low due to certain subjectivity of rule setting, and further the awakening effect of the awakening word set by the user is influenced.

Disclosure of Invention

The present disclosure is mainly directed to provide a method and an apparatus for assessing a wake-up word, a storage medium, and an electronic device, which are helpful for improving the accuracy of a wake-up word assessment result, and further improving the wake-up effect of the wake-up word set by a user.

In order to achieve the above object, the present disclosure provides a wake word evaluation method, including:

acquiring a word to be evaluated input by a user;

extracting evaluation features of the words to be evaluated, wherein the evaluation features are used for expressing the distinguishability of the words to be evaluated at an acoustic level and/or a semantic level;

and taking the evaluation characteristics of the word to be evaluated as input, and determining whether the word to be evaluated is suitable to be used as a wake-up word after being processed by a pre-constructed wake-up word evaluation model.

Optionally, the evaluation feature for representing the distinctiveness of the term to be evaluated at an acoustic level includes a distribution feature of a phonetic unit, and then the extracting the evaluation feature of the term to be evaluated includes: analyzing the voice units included in the word to be evaluated, and counting at least one of the total number of the voice units, the number of different voice units, the occurrence frequency of each different voice unit, the number of specified voice units and the occurrence frequency of each specified voice unit as the distribution characteristic of the voice units;

and/or the presence of a gas in the gas,

the evaluation feature for representing the distinctiveness of the term to be evaluated at the acoustic level includes a recognition probability of the term to be evaluated, and then the extracting the evaluation feature of the term to be evaluated includes: acquiring the recognition probability of a voice unit included in the word to be evaluated; taking the average value of the recognition probability of each voice unit as the recognition probability of the word to be evaluated, wherein the recognition probability comprises the accuracy and/or the false alarm rate;

and/or the presence of a gas in the gas,

the evaluation features used for representing the distinguishability of the word to be evaluated at the acoustic level comprise the duration of the word to be evaluated, and then the extracting the evaluation features of the word to be evaluated comprises the following steps: acquiring the time length of a voice unit included in the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation feature for representing the distinguishability of the word to be evaluated at the acoustic level comprises a tone feature of the word to be evaluated, and the extracting the evaluation feature of the word to be evaluated comprises the following steps: acquiring the tone of the single words included in the word to be evaluated, and calculating the tone variance between the adjacent single words; performing mathematical operation by using the tone variance between the adjacent single words to obtain the tone characteristics of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation features for representing the distinctiveness of the word to be evaluated at the semantic level include scores of a language model, and then the extracting the evaluation features of the word to be evaluated includes: taking the word to be evaluated as input, and outputting a score of the word to be evaluated after the word to be evaluated is processed by a pre-established language model, wherein the score is used for representing the occurrence frequency of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation features used for representing the distinctiveness of the to-be-evaluated word at the semantic level include part-of-speech features of the to-be-evaluated word, and then the extracting the evaluation features of the to-be-evaluated word includes: acquiring the part of speech of a word included in the word to be evaluated; counting the number of different parts of speech and the occurrence frequency of each different part of speech as the part of speech characteristics of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation features used for representing the distinguishability of the word to be evaluated at the semantic level include smoothness features of the word to be evaluated, and then the extracting the evaluation features of the word to be evaluated includes: calculating the forward semantic smoothness and the reverse semantic smoothness of the word to be evaluated by using the word included in the word to be evaluated; and performing mathematical operation by using the forward semantic smoothness and the reverse semantic smoothness to obtain the smoothness characteristics of the words to be evaluated.

Optionally, when it is determined that the word to be evaluated is not suitable as a wake word, the method further includes:

extracting problem features of the words to be evaluated;

and determining the problem type of the word to be evaluated according to the problem characteristics, wherein the problem type is used for representing the reason why the word to be evaluated is not suitable for being used as a wake-up word.

Optionally, the question feature includes a score of a language model, and the determining the question type of the word to be evaluated includes: taking the word to be evaluated as input, and outputting a score of the word to be evaluated after the word to be evaluated is processed by a pre-established language model, wherein the score is used for representing the occurrence frequency of the word to be evaluated; when the score of the word to be evaluated exceeds a preset score, judging that the problem type of the word to be evaluated is a high-frequency word;

and/or the presence of a gas in the gas,

the problem feature includes a duration of a word to be evaluated, and the determining of the problem type of the word to be evaluated includes: acquiring the time length of a voice unit included in the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated; when the duration of the word to be evaluated is less than the preset duration, judging that the problem type of the word to be evaluated is too short duration;

and/or the presence of a gas in the gas,

the problem features comprise the soft-sound features of the words to be evaluated, and the determining of the problem types of the words to be evaluated comprises the following steps: counting the number of the soft phoneme included in the word to be evaluated; and when the number of the soft phonemes exceeds a preset number, judging that the problem type of the word to be evaluated is excessive soft phonemes.

obtaining replaceable words corresponding to the words to be evaluated according to a pre-constructed semantic similar word knowledge graph;

extracting evaluation features of the replaceable words, wherein the evaluation features are used for representing the distinguishability of the replaceable words at an acoustic level and/or a semantic level;

taking the evaluation characteristics of the replaceable words as input, and determining whether the replaceable words are suitable to be used as wake-up words after the wake-up word evaluation model processes the evaluation characteristics;

recommending the alternative word to the user if the alternative word fits as a wake word.

The present disclosure provides a wake-up word evaluation device, the device comprising:

the evaluation word acquisition module is used for acquiring the evaluation word input by the user;

the evaluation feature extraction module is used for extracting evaluation features of the words to be evaluated, and the evaluation features are used for representing the distinguishability of the words to be evaluated at an acoustic level and/or a semantic level;

and the awakening word determining module is used for taking the evaluation characteristics of the word to be evaluated as input, processing the input by a pre-established awakening word evaluation model, and determining whether the word to be evaluated is suitable to be used as the awakening word.

Optionally, the evaluation feature extraction module is configured to analyze the phonetic units included in the word to be evaluated, and count at least one of a total number of the phonetic units, a number of different phonetic units, a number of times of occurrence of each different phonetic unit, a number of designated phonetic units, and a number of times of occurrence of each designated phonetic unit as the distribution feature of the phonetic units;

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for obtaining the recognition probability of the voice unit included by the word to be evaluated; taking the average value of the recognition probability of each voice unit as the recognition probability of the word to be evaluated, wherein the recognition probability comprises the accuracy and/or the false alarm rate;

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for obtaining the time length of a voice unit included in the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for obtaining the tone of the single words included in the word to be evaluated and calculating the tone variance between the adjacent single words if the evaluation feature used for representing the distinguishability of the word to be evaluated on the acoustic level comprises the tone feature of the word to be evaluated; performing mathematical operation by using the tone variance between the adjacent single words to obtain the tone characteristics of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for taking the word to be evaluated as input, outputting the score of the word to be evaluated after the word to be evaluated is processed by the pre-constructed language model, and the score is used for representing the occurrence frequency of the word to be evaluated;

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for acquiring the part of speech of a word included in the word to be evaluated; counting the number of different parts of speech and the occurrence frequency of each different part of speech as the part of speech characteristics of the word to be evaluated;

and/or the presence of a gas in the atmosphere,

the evaluation feature extraction module is used for calculating the forward semantic smoothness and the reverse semantic smoothness of the words to be evaluated by utilizing the words included by the words to be evaluated; and performing mathematical operation by using the forward semantic smoothness and the reverse semantic smoothness to obtain the smoothness characteristics of the words to be evaluated.

Optionally, the apparatus further comprises:

the problem feature extraction module is used for extracting the problem features of the words to be evaluated when the words to be evaluated are determined not to be suitable for being used as the awakening words;

and the problem type determining module is used for determining the problem type of the word to be evaluated according to the problem characteristics, wherein the problem type is used for representing the reason why the word to be evaluated is not suitable for being used as the awakening word.

Optionally, the problem feature includes a score of a language model, and the problem type determining module is configured to take the term to be evaluated as an input, process the term to be evaluated through a pre-established language model, and output the score of the term to be evaluated, where the score is used to indicate a frequency of occurrence of the term to be evaluated; when the score of the word to be evaluated exceeds a preset score, judging that the problem type of the word to be evaluated is a high-frequency word;

and/or the presence of a gas in the gas,

the problem type determining module is used for acquiring the duration of a voice unit included in the word to be evaluated if the problem characteristics comprise the duration of the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated; when the duration of the word to be evaluated is less than the preset duration, judging that the problem type of the word to be evaluated is too short duration;

and/or the presence of a gas in the gas,

the problem type determining module is used for counting the number of the light phoneme included in the word to be evaluated; and when the number of the soft phoneme exceeds the preset number, judging that the problem type of the word to be evaluated is excessive soft sound.

Optionally, the apparatus further comprises:

the replaceable word obtaining module is used for obtaining replaceable words corresponding to the words to be evaluated according to a pre-constructed semantic similar word knowledge graph when the words to be evaluated are determined not to be suitable as the awakening words;

the evaluation feature extraction module is used for extracting evaluation features of the replaceable words, and the evaluation features are used for representing the distinguishability of the replaceable words at an acoustic level and/or a semantic level;

the awakening word determining module is used for taking the evaluation characteristics of the replaceable words as input, and determining whether the replaceable words are suitable for being used as awakening words after the awakening word evaluation module processes the evaluation characteristics;

and the replaceable word recommending module is used for recommending the replaceable words to the user when the replaceable words are suitable to be used as the awakening words.

The present disclosure provides a storage medium having stored therein a plurality of instructions, the instructions being loaded by a processor, for performing the steps of the above-described wake-up word evaluation method.

The present disclosure provides an electronic device, comprising;

the storage medium described above; and

a processor to execute the instructions in the storage medium.

In the scheme, the awakening word evaluation can be performed based on the evaluation features of the words to be evaluated, specifically, the evaluation features can objectively reflect the distinguishability of the words to be evaluated on an acoustic level and/or a semantic level, and compared with the prior art in which the awakening word evaluation is performed through subjectively set rules, the scheme disclosed by the invention is beneficial to improving the accuracy of evaluation results and further improving the awakening effect of the awakening words set by a user.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a schematic flow chart of a wake word evaluation method according to the present disclosure;

FIG. 2 is a schematic diagram of a wake word evaluation apparatus according to the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device for wake word evaluation according to the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Referring to fig. 1, a flow chart of the wake word evaluation method of the present disclosure is shown. May include the steps of:

and S101, acquiring a word to be evaluated input by a user.

In the scheme, a user can set a word to be evaluated to be used as a wake-up word according to the requirement of the user. The composition of the words to be evaluated can be not specifically limited by the scheme, the same language can be used, and a plurality of languages can also be mixed, for example, the words to be evaluated are 'hello news flight', 'hello iflytek', 'hello news flight', and the like, and the words to be evaluated can be specifically set by a user according to requirements.

As an example, a user may input a word to be evaluated in a voice manner, and in response to this, the word to be evaluated input by the user may be acquired through a microphone; or, the user may input the word to be evaluated in a text manner, and accordingly, the word to be evaluated input by the user may be acquired through input and output devices such as a keyboard. The specific manner of obtaining the word to be evaluated may not be limited by the disclosed scheme.

In the practical application process, the evaluation process of the scheme can be realized by an intelligent device with a voice awakening function, and then the word to be evaluated is determined as the awakening word corresponding to the intelligent device according to the evaluation result; or, the evaluation process of the scheme of the present disclosure may be implemented by other dedicated devices, and then the word to be evaluated is configured to the corresponding intelligent device according to the evaluation result, so as to wake up the corresponding intelligent device. The subject of the evaluation process may not be particularly limited in the present disclosure.

S102, extracting the evaluation features of the words to be evaluated, wherein the evaluation features are used for representing the distinguishability of the words to be evaluated at an acoustic level and/or a semantic level.

After the words to be evaluated input by the user are obtained, evaluation features representing the distinctiveness of the words to be evaluated on an acoustic level and/or a semantic level can be extracted for processing and using by the awakening word evaluation model.

As an example, the evaluation feature for representing the distinctiveness of the word to be evaluated at the acoustic level may include at least one of the following features: the distribution characteristics of the voice units, the recognition probability of the words to be evaluated, the time length of the words to be evaluated and the tone characteristics of the words to be evaluated.

As an example, the evaluation feature for representing the distinctiveness of the word to be evaluated at the semantic level may include at least one of the following features: the score of the language model, the part of speech characteristics of the word to be evaluated and the smoothness characteristics of the word to be evaluated.

For the meaning of each feature representation and the specific extraction process, reference is made to the description below, and the detailed description is omitted here.

S103, the evaluation characteristics of the word to be evaluated are used as input, and whether the word to be evaluated is suitable to be used as a wake-up word is determined after the evaluation characteristics are processed by a pre-constructed wake-up word evaluation model.

After the evaluation features are extracted from the words to be evaluated, model processing can be performed by utilizing a pre-constructed awakening word evaluation model, and whether the words to be evaluated are suitable to be used as awakening words is determined.

As an example, the output of the wakeup word evaluation model may include 2 output nodes, which respectively represent that the word to be evaluated is suitable as the wakeup word and the word to be evaluated is not suitable as the wakeup word; or, the output of the awakening word evaluation model may include 1 output node, which is used to represent the evaluation score of the word to be evaluated, and if the evaluation score is smaller than a preset value, it is determined that the word to be evaluated is not suitable for being used as the awakening word; otherwise, judging that the word to be evaluated is suitable to be used as the awakening word. The output form of the awakening word evaluation model in the scheme of the disclosure may not be specifically limited.

In summary, after the word to be evaluated is obtained, the awakening word evaluation can be performed according to the distinctiveness of the word to be evaluated at the acoustic level and/or the semantic level. Generally, the better the distinctiveness of the word to be evaluated, the better the awakening effect when it is used as an awakening word. Compared with the prior art that awakening words are evaluated through subjectively set rules, the scheme disclosed by the invention has higher objectivity, is beneficial to improving the accuracy of evaluation results, and further improves the awakening effect of the awakening words set by a user.

As an example, after being processed by the awakening word evaluation model, if it is determined that the word currently input by the user and to be evaluated is not suitable as the awakening word, the following preferred scheme is further provided in the present disclosure to improve the success rate of setting the awakening word by the user.

According to the first preferred scheme, problem features of the words to be evaluated can be extracted; and determining the problem type of the word to be evaluated according to the problem characteristics, namely analyzing the reason why the word to be evaluated is not suitable to be used as the awakening word.

As an example, the question characteristics of the term to be evaluated may be embodied as a score of the language model. Correspondingly, the word to be evaluated can be used as input, and after the word to be evaluated is processed by the pre-constructed language model, the score of the word to be evaluated is output, wherein the score can represent the occurrence frequency of the word to be evaluated, and generally, the higher the score is, the higher the occurrence frequency is; then, the score of the word to be evaluated and the size of the preset score can be compared, when the score of the word to be evaluated exceeds the preset score, it is indicated that the frequency of the word to be evaluated is high, the word is likely to appear in daily conversation, and the possibility that the intelligent device is mistakenly awakened is increased, so that the problem type of the word to be evaluated is judged to be a high-frequency word, namely the reason that the word to be evaluated is not suitable to be used as the awakening word is that the word to be evaluated belongs to the high-frequency word.

For example, the language model may calculate the score of the term to be evaluated by:

word segmentation processing can be carried out on the word to be evaluated to obtain a word sequence { w₁，w₂，…，w_k，…，w_fIn which w_kThe kth word representing the word to be evaluated; then, the probability P (w) that f words appear in order of the word sequence is calculated₁，w₂，…，w_f) The frequency of occurrence of the word to be evaluated, i.e., the score of the word to be evaluated, is used.

In the scheme of the disclosure, the word to be evaluated is preferably utilized from w₁To w_fProbability of direction P (w)₁，w₂，…，w_f) The score representing the word to be evaluated may be embodied as the following formula:

wherein, P (w)_k|w_k-1) May be obtained by general corpus statistics.

As an example, the question characteristic of the word to be evaluated may be embodied as a duration of the word to be evaluated. Correspondingly, the duration of the voice unit included in the word to be evaluated can be obtained; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated; then, the time length of the word to be evaluated and the preset time length can be compared, when the time length of the word to be evaluated is smaller than the preset time length, the word to be evaluated is very short and possibly difficult to capture by the intelligent equipment in the practical application process, useful information in the word to be evaluated is extracted to be used for awakening the intelligent equipment, and therefore the problem type of the word to be evaluated is judged to be too short time length, namely the reason that the word to be evaluated is not suitable to be used as the awakening word is that the duration of the word to be evaluated is too short.

For example, the duration of the word to be evaluated may be calculated by:

firstly, the duration of each voice unit can be obtained through statistics, specifically, for each voice unit, the pronunciation duration of the voice unit corresponding to a plurality of speakers can be collected in advance, and then the pronunciation duration mean value of the plurality of speakers is determined as the duration of the pronunciation unit; then, the phonetic units included in the word to be evaluated can be analyzed, and the sum of the durations of the phonetic units can be determined as the duration of the word to be evaluated. For example, the phonetic unit may be embodied as a phoneme, a syllable, etc., which may not be specifically limited in the present disclosure.

As an example, the question feature of the word to be evaluated may be embodied as a soft-note feature of the word to be evaluated. Correspondingly, the number of the soft phonemes included in the word to be evaluated can be counted; comparing the number of the soft phonemes with the preset number, when the number of the soft phonemes exceeds the preset number, it is indicated that the word to be evaluated comprises more soft phonemes with poor distinctiveness, which may affect the awakening success rate of the intelligent awakening, so that it can be determined that the problem type of the word to be evaluated is over-soft-pitch, that is, the reason that the word to be evaluated is not suitable for being used as the awakening word is that the word to be evaluated comprises over-soft-pitch. For example, the word to be evaluated is "bodhi", wherein the word "bodhi" includes a soft sound p and "mention" includes a soft sound t.

It is understood that the preset number in the present disclosure may be a preset fixed number; or, the variable value may be calculated according to the total number of phonemes included in the word to be evaluated and a preset fixed ratio, which is not specifically limited in the present disclosure.

In the practical application process, the word to be evaluated is not suitable to be used as the awakening word due to single reason; or it may be for multiple reasons that make it unsuitable as a wake-up word. The present disclosure may not be particularly limited thereto.

And in the second preferred scheme, the words to be evaluated input by the user can be combined, and the awakening word recommendation is carried out for the user on the premise of ensuring the same or similar semantics as much as possible.

Specifically, replaceable words corresponding to the words to be evaluated can be obtained according to a pre-constructed semantic similar word knowledge graph; then, referring to the scheme shown in fig. 1, whether the alternative word is suitable as a wakeup word may be represented as: extracting evaluation features of the replaceable words, wherein the evaluation features are used for representing the distinguishability of the replaceable words at an acoustic level and/or a semantic level; taking the evaluation characteristics of the replaceable words as input, and determining whether the replaceable words are suitable for being taken as the awakening words after the awakening word evaluation model processes the evaluation characteristics; if the alternative word fits as a wake word, the alternative word may be recommended to the user.

As an example, alternative terms may also be determined for the term to be evaluated in conjunction with the question type of the term to be evaluated. For example, the problem type of the word to be evaluated, namely the "robot", is a high-frequency word, and the word to be evaluated, namely the "robot in the form of a small man", can be recommended to be modified as a replaceable word so as to reduce the score of the language model; the problem type of the word to be evaluated, namely 'starting up', is too short in duration, and the word to be evaluated, namely 'starting up' can be recommended to be modified as a replaceable word so as to increase the pronunciation duration; the question type of the word to be evaluated, i.e., the bodhi, is excessive in soft tones, and the word to be evaluated, i.e., the word "hello bodhi" can be recommended to be modified as an alternative word so as to reduce the number of soft tones.

In conclusion, the user can know the reason why the word to be evaluated is not suitable for being used as the awakening word, and then the word to be evaluated is modified in a targeted manner; in addition, in order to improve the success rate of user modification, the user can also be recommended with a wakeup word for the user to select and confirm. Therefore, the success rate of the user for setting the awakening words is improved, and the user experience is also improved.

The evaluation features in the present disclosure are explained below.

1. Evaluation feature representing the distinctiveness of a word to be evaluated at the acoustic level

(1) Distribution characteristics of phonetic units

As an example, the phonetic units included in the word to be evaluated can be analyzed, and the distribution characteristics of the phonetic units can be counted. For example, the distribution of phonetic units may be characterized by at least one of the following: the total number of phonetic units, the number of different phonetic units, the number of occurrences of each different phonetic unit, the number of designated phonetic units, the number of occurrences of each designated phonetic unit. The speech unit may be embodied as a phoneme, a syllable, and the like, which may not be specifically limited in this disclosure.

Generally, if the word to be evaluated contains too few phonetic units, for example, only one or two phonetic units, many pronunciations similar to the word to be evaluated may exist in daily conversations, so that the pronunciation distinctiveness of the word to be evaluated is low, and the possibility that the smart device is triggered by mistake is increased. In addition, if the word to be evaluated contains more phonetic units but all phonetic units are the same, for example, the word to be evaluated is "kayine", the pronunciation distinction of the word to be evaluated with a single pronunciation is also low, and false triggering is easy to generate. In view of the above, the present disclosure may extract the total number of phonetic units, the number of different phonetic units, and the number of times that each different phonetic unit appears, as the evaluation feature of the word to be evaluated.

Taking a voice unit as an example of a syllable, the word "ding-dong" to be evaluated can be divided into 4 voice units "ding", "dong", "ding" and "dong". In this example, the total number of phonetic units is 4; the number of different voice units is 2, which are respectively 'ding' and 'dong'; the number of occurrences of the speech unit "ding" is 2, and the number of occurrences of the speech unit "dong" is 2.

Taking a phonetic unit as an example, based on the bag of words thinking, considering that chinese or english shares 80 phonemes, the distribution characteristics of the phonetic unit may be set as an 80-dimensional vector, each dimension represents a phoneme, and the numerical value of each dimension represents the number of times the phoneme appears in the word to be evaluated.

In addition, it should be noted that, in order to improve the acoustic distinctiveness of the word to be evaluated, some designated voice units may be predetermined in the scheme of the present disclosure, and the more the number of designated units included in the word to be evaluated is, the better the acoustic distinctiveness is, the more suitable it is as a wake-up word. In this regard, the disclosed solution may further extract the number of the specified speech units and the occurrence frequency of each specified speech unit as the evaluation feature of the word to be evaluated.

For example, a speech unit with a large opening degree, a large loudness, a clear pronunciation, and an easy capture may be determined as a designated speech unit, for example, a combined vowel ua, iao, ian, iong of chinese, and a vowel ai, ao of english, which may be specifically set in combination with practical application requirements, and this may not be limited by the present disclosure.

(2) Recognition probability of word to be evaluated

In the scheme of the present disclosure, the recognition probability of the word to be evaluated may be embodied as: the accuracy rate of the word to be evaluated and/or the false alarm rate of the word to be evaluated. Generally, the higher the accuracy rate and the lower the false alarm rate of the word to be evaluated, the better the acoustic distinction is, and the more suitable the word is as a wake-up word.

As an example, the recognition probability of the word to be evaluated may be obtained by an offline test. Taking the accuracy rate of the word to be evaluated as an example, N positive example samples of the word to be evaluated can be collected under different environments, the number M of correctly identified samples in the samples is counted, and the accuracy rate under each environment is calculated by utilizing M/N; and then determining the average value of the accuracy rates in all environments as the accuracy rate of the word to be evaluated. Taking the false alarm rate of the word to be evaluated as an example, the number of times that the word to be evaluated is mistakenly awakened as an awakening word in a predetermined time period can be monitored in different environments, for example, the false alarm rate in a certain environment is mistakenly awakened for 2 times in 24 hours; and then determining the average value of the false alarm rates under different environments as the false alarm rate of the word to be evaluated.

As an example, the recognition probability of the word to be evaluated may be obtained based on the phonetic units included in the word to be evaluated. Specifically, the recognition probability of the speech unit included in the word to be evaluated can be obtained; and taking the average value of the recognition probability of each voice unit as the recognition probability of the word to be evaluated. The recognition rate and the false alarm rate of the phonetic unit can be obtained by off-line statistics with reference to the above description, and are not described in detail here.

(3) Duration of word to be evaluated

Generally, the longer the duration of a word to be evaluated, the better its acoustic distinctiveness, the more suitable it is as a wake-up word. The process of obtaining the duration of the word to be evaluated can be described in the above analysis of the problem types, and is not further described here.

(4) Tonal characteristics of words to be evaluated

As an example, the pitch of the single word included in the word to be evaluated may be obtained, and the pitch variance between adjacent single words may be calculated, for example, if the pitches of two adjacent single words are consistent, the pitch variance is 0; otherwise, the pitch variance is 1; then, the pitch variance between adjacent single words is used to perform mathematical operation to calculate the pitch feature of the word to be evaluated, for example, the sum of the pitch variances or the mean of the pitch variances may be determined as the pitch feature of the word to be evaluated.

For example, a pre-constructed pitch classifier can be used to obtain the pitch sequence { b ] of the word to be evaluated₁，b₂，…，b_j，…，b_nIn which b is_jAnd indicating the tone category corresponding to the j-th single word of the word to be evaluated. Taking Chinese as an example, the tone category of a single character can be embodied as 4 common tones, and identifiers of '1', '2', '3' and '4' can be used for representing different tones; alternatively, the tone category of a single word may be determined in combination with other languages, which may not be specifically limited by the present disclosure.

Generally, the pronunciation of the word to be evaluated is more distinctive, i.e., the larger the tone feature value of the word to be evaluated is, the better the acoustic distinctiveness is, the more suitable it is as a wake-up word.

2. Evaluation feature representing the distinctiveness of a word to be evaluated at the semantic level

(1) Score of language model

In general, the higher the score of the language model, the higher the probability of being false triggered, and the less suitable it is as a wake-up word. The process of obtaining the score of the word to be evaluated can be described in the above analysis of question types, and is not further described here.

(2) Part-of-speech characteristics of a word to be evaluated

As an example, part of speech of a word included in a word to be evaluated may be obtained; and counting the number of different parts of speech and the occurrence frequency of each different part of speech to serve as the part of speech characteristics of the word to be evaluated. Generally, the richer part-of-speech characteristics contained in the word to be evaluated, the better the semantic distinction is, and the more suitable the word to be evaluated is as a wake-up word.

For example, the word to be evaluated may be segmented to obtain a part-of-speech sequence { q }₁，q₂，…，q_k，…，q_fWherein q is_kRepresenting the part of speech of the k word of the word to be evaluated. As an example, the following 11 parts of speech are targeted: nouns, verbs, adjectives, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary verbs, sighs and vocabularies, the part-of-speech characteristics of the words to be evaluated can be set as an 11-dimensional vector, each dimension represents a part-of-speech, and the numerical value of each dimension represents the number of times the part-of-speech appears in the words to be evaluated.

(3) Smoothness characteristics of words to be evaluated

As an example, forward semantic smoothness and reverse semantic smoothness of the word to be evaluated can be calculated by using the word included in the word to be evaluated; and performing mathematical operation by using the forward semantic smoothness and the reverse semantic smoothness to obtain the smoothness characteristics of the words to be evaluated.

The calculation of semantic smoothness is described in the above analysis of question types and will not be described in detail here. Wherein, the forward semantic smoothness can be embodied as the word to be evaluated from w₁To w_fProbability of direction P (w)₁，w₂，…，w_f) The reverse semantic smoothness can be embodied as the word to be evaluated from w_fTo w₁Probability of direction P (w)_f，w_f-1，…，w₁)。

For example, the mathematical operation performed on the forward semantic smoothness and the reverse semantic smoothness may be embodied as an absolute value of a difference between the forward semantic smoothness and the reverse semantic smoothness. Generally, the larger the smoothness characteristic value obtained based on the above, the more reasonable the forward direction is, the easier the word to be evaluated is to be expressed, and the more suitable the word is to be used as a wake-up word.

For example, the mathematical operation performed on the forward semantic smoothness and the reverse semantic smoothness may be expressed as a quotient of the forward semantic smoothness and the reverse semantic smoothness. Generally, the larger the smoothness characteristic value obtained based on the above, the more reasonable the forward direction is, the easier the word to be evaluated is to be expressed, and the more suitable the word is to be used as a wake-up word.

As an example, a large number of sample wake words may be collected, and based on this training, the wake word evaluation model in the present disclosure is obtained. The sample awakening words can be embodied as positive sample awakening words and negative sample awakening words; in addition, the positive example sample awakening word can be pre-labeled to be suitable as the awakening word, and the negative example sample awakening word can be pre-labeled to be not suitable as the awakening word.

When performing model training, the topological structure of the wakeup word evaluation model may be determined, and for example, the topological structure may be embodied as CNN (Convolutional Neural Network, chinese), RNN (Recurrent Neural Network, chinese, Recurrent Neural Network), DNN (Deep Neural Network, chinese), and the like, which is not specifically limited in this disclosure. Therefore, after the evaluation features are extracted from the sample awakening words, the awakening word evaluation model can be trained by combining the selected topological structure and the evaluation features of the sample awakening words until the evaluation result output by the awakening word evaluation model is consistent with the evaluation result labeled by the sample awakening words.

Referring to fig. 2, a schematic diagram of the wake word evaluation apparatus of the present disclosure is shown. The apparatus may include:

a word to be evaluated obtaining module 201, configured to obtain a word to be evaluated input by a user;

an evaluation feature extraction module 202, configured to extract an evaluation feature of the term to be evaluated, where the evaluation feature is used to indicate a distinguishability of the term to be evaluated at an acoustic level and/or a semantic level;

and the awakening word determining module 203 is configured to take the evaluation characteristics of the to-be-evaluated word as input, and determine whether the to-be-evaluated word is suitable as an awakening word after being processed by a pre-established awakening word evaluation model.

and/or the presence of a gas in the gas,

the evaluation feature extraction module is used for obtaining the recognition probability of the voice unit included in the word to be evaluated; taking the average value of the recognition probability of each voice unit as the recognition probability of the word to be evaluated, wherein the recognition probability comprises the accuracy and/or the false alarm rate;

and/or the presence of a gas in the gas,

Optionally, the apparatus further comprises:

and/or the presence of a gas in the gas,

the question type determining module is used for acquiring the duration of a voice unit included by the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated; when the duration of the word to be evaluated is less than the preset duration, judging that the problem type of the word to be evaluated is too short duration;

and/or the presence of a gas in the gas,

the problem type determining module is used for counting the number of the light phoneme included in the word to be evaluated; and when the number of the soft phonemes exceeds a preset number, judging that the problem type of the word to be evaluated is excessive soft phonemes.

Optionally, the apparatus further comprises:

the awakening word determining module is used for taking the evaluation characteristics of the replaceable words as input, and determining whether the replaceable words are suitable for being used as awakening words after the awakening word evaluation model processes the evaluation characteristics;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 3, a schematic structural diagram of an electronic device 300 for wake word evaluation according to the present disclosure is shown. Referring to fig. 3, an electronic device 300 includes a processing component 301 that further includes one or more processors, and storage device resources, represented by storage media 302, for storing instructions, such as application programs, that are executable by the processing component 301. The application programs stored in the storage medium 302 may include one or more modules that each correspond to a set of instructions. Further, the processing component 301 is configured to execute instructions to perform the above-described wake word evaluation method.

Electronic device 300 may also include a power component 303 configured to perform power management of electronic device 300; a wired or wireless network interface 304 configured to connect the electronic device 300 to a network; and an input/output (I/O) interface 305. The electronic device 300 may operate based on an operating system stored on the storage medium 302, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A wake word evaluation method, the method comprising:

acquiring a word to be evaluated input by a user;

extracting the evaluation features of the words to be evaluated, wherein the evaluation features are used for representing the distinguishability of the words to be evaluated on an acoustic level and/or a semantic level, and the evaluation features used for representing the distinguishability of the words to be evaluated on the acoustic level comprise the distribution features of the voice units, the recognition probability of the words to be evaluated, the time length of the words to be evaluated and the tone features of the words to be evaluated;

and taking the evaluation characteristics of the word to be evaluated as input, and determining whether the word to be evaluated is suitable to be used as a wake-up word after the evaluation characteristics are processed by a pre-constructed wake-up word evaluation model.

2. The method of claim 1,

the evaluation feature for representing the distinctiveness of the word to be evaluated at an acoustic level includes a distribution feature of a speech unit, and then the extracting the evaluation feature of the word to be evaluated includes: analyzing the voice units included in the word to be evaluated, and counting at least one of the total number of the voice units, the number of different voice units, the occurrence frequency of each different voice unit, the number of specified voice units and the occurrence frequency of each specified voice unit as the distribution characteristic of the voice units;

and/or the presence of a gas in the gas,

the evaluation feature for representing the distinctiveness of the term to be evaluated at the acoustic level includes a duration of the term to be evaluated, and then the extracting the evaluation feature of the term to be evaluated includes: acquiring the time length of a voice unit included in the word to be evaluated; taking the sum of the time lengths of all the voice units as the time length of the word to be evaluated;

and/or the presence of a gas in the gas,

3. The method according to claim 1 or 2, wherein when it is determined that the word to be evaluated does not fit as a wake up word, the method further comprises:

extracting problem features of the words to be evaluated;

4. The method of claim 3,

the question features comprise scores of language models, and the determining of the question type of the word to be evaluated comprises: taking the word to be evaluated as input, and outputting a score of the word to be evaluated after the word to be evaluated is processed by a pre-constructed language model, wherein the score is used for expressing the occurrence frequency of the word to be evaluated; when the score of the word to be evaluated exceeds a preset score, judging that the problem type of the word to be evaluated is a high-frequency word;

and/or the presence of a gas in the gas,

the question feature comprises a soft-sound feature of the word to be evaluated, and the determining of the question type of the word to be evaluated comprises: counting the number of the soft phoneme included in the word to be evaluated; and when the number of the soft phonemes exceeds a preset number, judging that the problem type of the word to be evaluated is excessive soft phonemes.

5. The method according to claim 1 or 2, wherein when it is determined that the word to be evaluated does not fit as a wake up word, the method further comprises:

extracting evaluation features of the alternative words, wherein the evaluation features are used for representing the distinguishability of the alternative words at an acoustic level and/or a semantic level;

6. A wake word evaluation apparatus, the apparatus comprising:

the evaluation feature extraction module is used for extracting evaluation features of the words to be evaluated, wherein the evaluation features are used for representing the distinguishability of the words to be evaluated on an acoustic level and/or a semantic level, and the evaluation features used for representing the distinguishability of the words to be evaluated on the acoustic level comprise the distribution features of voice units, the recognition probability of the words to be evaluated, the time length of the words to be evaluated and the tone features of the words to be evaluated;

7. The apparatus of claim 6,

the evaluation feature extraction module is configured to analyze the voice units included in the word to be evaluated, and count at least one of the total number of the voice units, the number of different voice units, the number of times of occurrence of each different voice unit, the number of designated voice units, and the number of times of occurrence of each designated voice unit as the distribution feature of the voice units;

and/or the presence of a gas in the atmosphere,

and/or the presence of a gas in the gas,

8. The apparatus of claim 6 or 7, further comprising:

the problem feature extraction module is used for extracting the problem features of the words to be evaluated when the words to be evaluated are determined to be unsuitable as the awakening words;

9. The apparatus of claim 8,

the problem type determining module is used for taking the word to be evaluated as input, outputting the score of the word to be evaluated after the word to be evaluated is processed by a pre-constructed language model, wherein the score is used for expressing the occurrence frequency of the word to be evaluated; when the score of the word to be evaluated exceeds a preset score, judging that the problem type of the word to be evaluated is a high-frequency word;

and/or the presence of a gas in the gas,

and/or the presence of a gas in the atmosphere,

10. The apparatus of claim 6 or 7, further comprising:

the awakening word determining module is used for taking the evaluation characteristics of the replaceable words as input, and determining whether the replaceable words are suitable for being used as awakening words after the awakening word evaluation module processes the evaluation characteristics; and the replaceable word recommending module is used for recommending the replaceable words to the user when the replaceable words are suitable to be used as the awakening words.

11. A storage medium having stored thereon a plurality of instructions, wherein the instructions are loadable by a processor and adapted to cause execution of the steps of the method according to any of claims 1 to 5.

12. An electronic device, characterized in that the electronic device comprises:

the storage medium of claim 11; and

a processor to execute the instructions in the storage medium.