CN112634900A

CN112634900A - Method and apparatus for detecting phonetics

Info

Publication number: CN112634900A
Application number: CN202110258035.5A
Authority: CN
Inventors: 邓玉龙; 刘琼琼; 丁文彪; 刘子韬
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-04-09

Abstract

The invention relates to a phonetics detection method and a phonetics detection device. The method for detecting the dialogues comprises the following steps: acquiring at least one preset hot word; aiming at each hot word, acquiring an extended pinyin set corresponding to the hot word, wherein the extended pinyin set comprises a full pinyin of the hot word and a fuzzy pinyin corresponding to the full pinyin; acquiring a first non-standard pinyin set corresponding to the extended pinyin set; acquiring a second non-standard pinyin set corresponding to the first non-standard pinyin set; acquiring a standard pinyin set; and acquiring the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and taking the hot word corresponding to the target pinyin as a pronunciation detection result. The phonetics detection method can improve the accuracy of phonetics detection.

Description

Method and apparatus for detecting phonetics

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech technology detection method and a speech technology detection apparatus.

Background

The detection of words refers to the detection of some specific content in the audio. The process is to input audio files and related hot word lists, and detect whether the text after automatic speech recognition contains forbidden words or specific hot words (such as name, nickname and the like) or contains praise, guide, correct contents and the like.

In the prior art, the phonetic detection generally adopts a simple keyword matching method to detect the hotwords in the audio, so that the accuracy of the phonetic detection is low.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present invention provide a method and an apparatus for detecting a word operation, which can improve accuracy of the word operation detection.

In a first aspect, an embodiment of the present invention provides a method for detecting a word operation, including:

acquiring at least one preset hot word;

aiming at each hot word, acquiring an extended pinyin set corresponding to the hot word, wherein the extended pinyin set comprises a full pinyin of the hot word and a fuzzy pinyin corresponding to the full pinyin;

acquiring a first non-standard pinyin set corresponding to the extended pinyin set, wherein part of letters of the first non-standard pinyin in the first non-standard pinyin set and part of letters of the pinyin in the extended pinyin set meet a preset corresponding relationship;

acquiring a second non-standard pinyin set corresponding to the first non-standard pinyin set, wherein the edit distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold;

acquiring a standard pinyin set, wherein part of letters of the standard pinyin in the standard pinyin set and part of letters of the second non-standard pinyin in the second non-standard pinyin set meet the preset corresponding relationship;

and acquiring the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and taking the hot word corresponding to the target pinyin as a pronunciation detection result.

Optionally, the obtaining of the extended pinyin set corresponding to the hotword includes:

acquiring a full spelling of the hotword;

acquiring fuzzy pinyin corresponding to the full pinyin according to the comparison relationship of the initial fuzzy sound, the comparison relationship of the final fuzzy sound and/or the comparison relationship of the letter combination fuzzy sound in the fuzzy sound comparison table;

and determining the extended pinyin set according to the full pinyin and the fuzzy pinyin corresponding to the full pinyin.

Optionally, the obtaining a first non-standard pinyin set corresponding to the extended pinyin set includes:

and acquiring the first non-standard pinyin corresponding to the full pinyin and the first non-standard pinyin corresponding to the fuzzy pinyin according to the letter comparison relationship and/or the letter combination comparison relationship in the non-standard pinyin comparison table.

Optionally, the obtaining a standard pinyin set includes:

and acquiring the standard pinyin corresponding to the second non-standard pinyin in the second non-standard pinyin set according to the letter comparison relationship and/or the letter combination comparison relationship in the non-standard pinyin comparison table.

Optionally, the obtaining a target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected includes:

deleting invalid standard pinyin which does not have a corresponding relation with the Chinese characters in the standard pinyin set according to the corresponding relation between the standard pinyin and the Chinese characters, and obtaining an effective standard pinyin set;

and traversing the pinyin sequence of the voice text to be detected according to the effective standard pinyin in the effective standard pinyin set to obtain the target pinyin which is the same as the target pinyin in the effective standard pinyin set in the pinyin sequence.

Optionally, the tactical detection method further comprises:

obtaining a classification result of the voice recognition text;

the taking the hot word corresponding to the target pinyin as a detection result of the word operation comprises the following steps:

and taking the hot words corresponding to the target pinyin and the classification result corresponding to the voice text to be detected as the voice technology detection result.

Optionally, the obtaining a classification result of the speech recognition text includes:

replacing the at least one hot word in the speech recognition text with a uniform identifier;

and acquiring the classification result according to the replaced voice recognition text.

Optionally, before the obtaining of the target pinyin that is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, the method further includes:

if a Chinese character is included before a first punctuation mark in the voice recognition text, correcting the Chinese character before the first punctuation mark into two same Chinese characters;

if the voice recognition text comprises English letters, correcting the English letters into Chinese characters with the same pronunciation as the English letters;

and acquiring the voice text to be detected according to the corrected voice recognition text.

Optionally, before said replacing said at least one hot word in said speech recognition text with a uniform identifier; further comprising:

carrying out standardized processing on the audio to be detected;

and acquiring the voice recognition text according to the standardized audio to be detected.

In a second aspect, an embodiment of the present invention provides a speech detection apparatus, including:

the hot word acquisition module is used for acquiring at least one preset hot word;

the pinyin expansion module is used for acquiring an expanded pinyin set corresponding to each hot word, wherein the expanded pinyin set comprises a full pinyin of the hot word and a fuzzy pinyin corresponding to the full pinyin;

the first non-standard module is used for acquiring a first non-standard pinyin set corresponding to the extended pinyin set, wherein a part of letters of the first non-standard pinyin in the first non-standard pinyin set and a part of letters of the pinyin in the extended pinyin set meet a preset corresponding relationship;

the second non-standard module is used for acquiring a second non-standard pinyin set corresponding to the first non-standard pinyin set, wherein the edit distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold value;

the standardization module is used for acquiring a standard pinyin set, wherein part of letters of the standard pinyin in the standard pinyin set and part of letters of the second non-standard pinyin in the second non-standard pinyin set meet the preset corresponding relation;

and the detection module is used for acquiring the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and taking the hot word corresponding to the target pinyin as a speech detection result.

According to the technical scheme provided by the embodiment of the invention, at least one preset hot word is acquired; aiming at each hot word, acquiring an extended pinyin set corresponding to the hot word, wherein the extended pinyin set comprises a full pinyin of the hot word and a fuzzy pinyin corresponding to the full pinyin; acquiring a first non-standard pinyin set corresponding to the extended pinyin set, wherein part of letters of the first non-standard pinyin in the first non-standard pinyin set and part of letters of the pinyin in the extended pinyin set meet a preset corresponding relationship; acquiring a second non-standard pinyin set corresponding to the first non-standard pinyin set, wherein the edit distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold value; acquiring a standard pinyin set, wherein part of letters of the standard pinyin in the standard pinyin set and part of letters of the second non-standard pinyin in the second non-standard pinyin set meet the preset corresponding relationship; the method comprises the steps of obtaining the target pinyin which is the same as that in a standard pinyin set in a pinyin sequence of a voice text to be detected, taking a hot word corresponding to the target pinyin as a dialect detection result, expanding non-standard pinyins corresponding to the full pinyin of the hot word, and improving the number of the standard pinyins corresponding to the full pinyin of the hot word, so that the accuracy of the dialect detection result is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a speech detection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another speech detection method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of another speech detection method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of another speech detection method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a classification model according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of another speech detection method according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of another speech detection method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a speech detection apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Fig. 1 is a schematic flow chart of a speech detection method according to an embodiment of the present invention, as shown in fig. 1, specifically including:

s110, at least one preset hot word is obtained.

Specifically, at least one hot word is obtained according to actual needs, and all the hot words form a hot word set. For example: in the teaching process, the names of the students can be used as hot words, and the names of all the students in the whole class form a hot word set.

S120, aiming at each hot word, obtaining an extended pinyin set corresponding to the hot word, wherein the extended pinyin set comprises the full pinyin of the hot word and the fuzzy pinyin corresponding to the full pinyin.

Specifically, the detection process of the dialogies mainly comprises the following steps: firstly, converting audio into a voice recognition text, secondly, acquiring a voice text to be detected according to the voice recognition text, and finally, detecting hot words in the voice text to be detected. In the process of converting audio into speech recognition text, due to the accent of the speaker, the fault tolerance rate in the speech recognition process, and the like, part of characters in the speech recognition text may not correspond to part of information in the audio. Therefore, the full spelling of each hot word is expanded to form fuzzy spelling aiming at each hot word in the hot word set, so that an expanded spelling set corresponding to each hot word is formed according to the full spelling of each hot word and the corresponding fuzzy spelling, and the influence of wrong dialogue detection results generated in the process of converting audio frequency into voice recognition texts can be reduced.

Illustratively, the obtained hot word is "bent," the full spelling corresponding to the "bent" is "wanwan," and the fuzzy pinyins corresponding to the full spelling are "wangwan," "wangwang," "wanwanwanwanwanwanwanwanwang," the full spelling "wanwan" and the corresponding fuzzy pinyins "wangwan," "wangwang," and "wanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwanwaning" constitute an extended pinyin set [ wangwan, wangwang, wanwanwanwanwanwanwanwaning, wanwan ] corresponding to the hot word "bent".

S130, a first non-standard pinyin set corresponding to the extended pinyin set is obtained.

And the partial letters of the first non-standard pinyin in the first non-standard pinyin set and the partial letters of the pinyins in the extended pinyin set meet a preset corresponding relationship.

Specifically, the standard pinyin scheme cannot well reflect the pronunciation similarity problem of the Chinese characters, for example, the pronunciation of the dug full pinyin wa is similar to that of the Hua full pinyin hua, if the standard pinyin scheme is adopted, the edit distance between wa and hua is 2, and if the non-standard pinyin scheme is adopted, the dug non-standard pinyin is ua, and the edit distance between Hua non-standard pinyin is still hua, ua and hua is 1, so that the non-standard pinyin scheme can more accurately describe the rule that the voice recognition text has errors. The embodiment of the invention can obtain the first non-standard pinyin set corresponding to the extended pinyin set of each hot word, and compared with the pinyin in the extended pinyin set, the first non-standard pinyin in the first non-standard pinyin set can more accurately describe the rule that the voice recognition text has errors, thereby being beneficial to improving the accuracy of the verbal detection result.

Illustratively, based on the above embodiment, the first non-standard pinyin set corresponding to the extended pinyin set [ wangwan, wangwang, wanwanwanwanwanwanwanwanwanwang, wanwan ] is [ uanguan, uanuang, uanua ].

S140, a second non-standard pinyin set corresponding to the first non-standard pinyin set is obtained.

And the editing distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold value.

Specifically, based on a first non-standard pinyin in a first non-standard pinyin set, a second non-standard pinyin of which the edit distance from the first non-standard pinyin is equal to a preset threshold value and a second non-standard pinyin of which the edit distance from the first non-standard pinyin is smaller than the preset threshold value are obtained. The second non-standard pinyin with the editing distance of 0 to the first non-standard pinyin is the first non-standard pinyin, and obviously the second non-standard pinyin comprises the first non-standard pinyin, so that the first non-standard pinyin is expanded to obtain the second non-standard pinyin, the expansion of the non-standard pinyin corresponding to the full pinyin of the hot word is realized, the number of the standard pinyins corresponding to the full pinyin of the hot word is increased, and the accuracy of the word detection result is improved.

Illustratively, based on the above embodiment, the preset threshold is 1, and the second non-standard pinyin with the edit distance of 1 from the first non-standard pinyin "uan" is "auan", "buan", "cuan", "an", and so on.

S150, obtaining a standard pinyin set.

And the part of letters of the standard pinyin in the standard pinyin set and the part of letters of the second non-standard pinyin in the second non-standard pinyin set meet the preset corresponding relationship.

Specifically, the subsequent detection needs to be matched with the pinyin sequence of the voice text to be detected, and since the pinyin sequence of the voice text to be detected is the standard pinyin, the second non-standard pinyin needs to be converted into the standard pinyin, that is, the standard pinyin set corresponding to each hot word is obtained, so as to ensure that the tactical detection is performed smoothly. It should be noted that the conversion relationship between the second non-standard pinyin in the second non-standard pinyin set and the standard pinyin in the standard pinyin set is the same as the conversion relationship between the pinyin in the extended pinyin set and the first non-standard pinyin in the first non-standard pinyin set in step S130.

S160, obtaining the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and taking the hot word corresponding to the target pinyin as a conversational detection result.

Specifically, a pinyin sequence of the voice text to be detected is obtained, the pinyin sequence of the voice text to be detected is matched according to the standard pinyin in the standard pinyin set, if the pinyin sequence of the voice text to be detected has the same pinyin as the standard pinyin in the standard pinyin set, the same pinyin is the target pinyin, and the hot word corresponding to the target pinyin is used as the word detection result. And if the pinyin sequence of the detected voice text does not have the pinyin same as the standard pinyin in the standard pinyin set, indicating that the voice text to be detected does not contain the hot words.

Optionally, fig. 2 is a schematic flow chart of a speech detection method according to an embodiment of the present invention, as shown in fig. 2, when executing S120 shown in fig. 1, the method includes:

and S121, acquiring the full spelling of the hotword.

Specifically, the full spelling of each hot word in the hot word set is obtained according to the Chinese pinyin dictionary.

And S122, acquiring the fuzzy pinyin corresponding to the full pinyin according to the comparison relationship of the initial fuzzy sound, the comparison relationship of the final fuzzy sound and/or the comparison relationship of the letter combination fuzzy sound in the fuzzy sound comparison table.

Specifically, the table 1 is a fuzzy sound comparison table, which includes an initial fuzzy sound comparison relationship, a final fuzzy sound comparison relationship and a letter combination fuzzy sound comparison relationship, and expands the full spelling of the hotword according to the initial fuzzy sound comparison relationship, the final fuzzy sound comparison relationship and/or the letter combination fuzzy sound comparison relationship.

TABLE 1 fuzzy tone reference table

Comparison relationship between consonants and fuzzy sounds	Comparison relationship between vowels and fuzzy sounds	Letter combination fuzzy sound contrast relation
			s ↔sh	an ↔ang	fa ↔hua
c ↔ch	en↔eng	fan ↔huan
			z ↔zh	in ↔ing	fang ↔huang
l ↔ n	ian↔iang	fei↔ hui
			f ↔ h	uan↔uang	fen ↔hun
r ↔ l		feng↔hong
					fo↔huo
		fu↔hu

Comparison relationship between consonants and fuzzy sounds

Illustratively, based on the above embodiment, for the full pinyin "wanwan" of the hotword "bending," the corresponding fuzzy pinyins are "wanwan", "wangwan" and "wangwan" according to the comparison relationship of the vowel fuzzy sounds in table 1. Aiming at the full spelling 'xiaozheng' of the hot word 'Xiaozheng', the corresponding fuzzy spelling is 'xiaozheng' can be obtained according to the comparison relation of the consonant fuzzy tones in the table 1. The embodiment of the present invention is only exemplary, and the fuzzy pinyin corresponding to the full pinyin of the hotword is obtained according to the initial fuzzy sound contrast relationship or the final fuzzy sound contrast relationship in the fuzzy sound contrast table, and in other embodiments, the fuzzy pinyin may be obtained according to one or more of the initial fuzzy sound contrast relationship, the final fuzzy sound contrast relationship and the letter combination fuzzy sound contrast relationship in the fuzzy sound contrast table.

S123, determining the extended pinyin set according to the full pinyin and the fuzzy pinyin corresponding to the full pinyin.

Specifically, based on the above embodiment, the fuzzy pinyins corresponding to the full pinyin of the hot word are obtained, and the full pinyin of each hot word and the corresponding fuzzy pinyins thereof are determined as the extended pinyin set corresponding to the hot word. Obviously, the expanded pinyin set comprises the full pinyin of the hot word and the fuzzy pinyin corresponding to the full pinyin, namely the expanded pinyin corresponding to the hot word is realized, the possible wrong words of the hot word in the voice recognition text can be obtained, the error of the voice recognition text can be corrected, and the accuracy of the dialect detection result is improved.

Illustratively, the fuzzy pinyins corresponding to the full spelling "wanwan" of the hot word "bent" are "wanwan", "wangwan" and "wangwang", and the extended pinyin set corresponding to the hot word "bent" is [ wanwan, wanwanwanwanwanwanwanwanwan, wangwan ].

Optionally, with continuing reference to fig. 2, when performing step S130 shown in fig. 1, the method includes:

s131, according to the letter comparison relation and/or letter combination comparison relation in the non-standard pinyin comparison table, obtaining the first non-standard pinyin corresponding to the full pinyin and the first non-standard pinyin corresponding to the fuzzy pinyin.

Specifically, a portion of the standard pinyin is converted to a corresponding non-standard pinyin based on the actual pronunciation of the text. Table 2 is a non-standard pinyin comparison table, which includes alphabetic comparison relationships and alphabetic combination comparison relationships. The non-standard pinyin is closer to the actual pronunciation of characters, so that the method is favorable for more accurately describing the rule that the speech recognition text has errors, and is further favorable for improving the accuracy of the dialect detection result.

TABLE 2 non-standard pinyin comparison table

u ↔wu	ua↔wa	uo↔ wo
			uai↔wai	ui↔wei	uan↔ wan
uang↔wang	un ↔weng	ueng↔weng
			i↔yi	ia↔ya	ie↔ ye
iao↔yao	iu↔ you	ian↔yan
			iang↔ yang	in ↔ yin	ing↔ying
iong↔yong	ü↔yu	üe↔yue
			üan↔ yuan	ün↔yun

Illustratively, based on the above-described embodiment, for the extended pinyin set [ wanwan, wangwan, wangwang ], according to the letter combination contrast relationship in table 2, the first non-standard pinyin corresponding to the full pinyin "wanwan" of the hot word "bend" is "uanutan", the first non-standard pinyin corresponding to the fuzzy tone "wanwanwanwan" of the hot word "bend" is "uanutang", the first non-standard pinyin corresponding to the fuzzy tone "wangwan" of the hot word "bend" is "uanguan", and the first non-standard pinyin corresponding to the fuzzy tone "wangwang" of the hot word "bend" is "uanguang", so that the first non-standard set is [ uanan, uanuanuanuanuangan, uanguan, uanguang ] can be obtained.

Optionally, with continuing reference to fig. 2, when performing S150 as shown in fig. 1, the method includes:

s151, obtaining the standard pinyin corresponding to each second non-standard pinyin in the second non-standard pinyin set according to the letter comparison relationship and/or the letter combination comparison relationship in the non-standard pinyin comparison table.

Specifically, according to the letter comparison relationship and/or the letter combination comparison relationship in table 2, the second non-standard pinyin in the second non-standard pinyin set is converted into the corresponding standard pinyin, so as to ensure that the second non-standard pinyin can be matched with the pinyin sequence of the voice text to be detected.

Fig. 3 is a schematic flow chart of another speech detection method according to an embodiment of the present invention, as shown in fig. 3, when S160 is executed, the method includes:

s161, according to the corresponding relation between the standard pinyin and the Chinese characters, deleting the invalid standard pinyin which does not have the corresponding relation with the Chinese characters in the standard pinyin set, and obtaining an effective standard pinyin set.

Specifically, the pinyin dictionary gives the correspondence between the chinese character and the standard pinyin, and the standard pinyin which does not have a correspondence with the chinese character is defined as an invalid standard pinyin. And deleting the invalid standard pinyin in the standard pinyin set to obtain an effective standard pinyin set corresponding to the hot word.

And S162, traversing the pinyin sequence of the voice text to be detected according to the effective standard pinyin in the effective standard pinyin set, and acquiring the target pinyin which is the same as the target pinyin in the effective standard pinyin set in the pinyin sequence.

Specifically, the pinyin sequence of the voice text to be detected is obtained according to the pinyin dictionary, and the pinyin sequence of the voice text to be detected is traversed according to the effective standard pinyin in the effective standard pinyin set. If the target pinyin which is the same as the effective standard pinyin in the effective standard pinyin set exists in the pinyin sequence of the voice text to be detected, the fact that the hot words exist in the pinyin sequence of the voice text to be detected is indicated, and the hot words corresponding to the target pinyin are used as the detection result of the word operation. Since the invalid standard pinyin does not exist in the pinyin dictionary, the invalid standard pinyin does not appear in the pinyin sequence of the voice text to be detected, the embodiment of the invention can eliminate the invalid standard pinyin in the standard pinyin corresponding to the hot word, improve the effectiveness of the standard pinyin corresponding to the hot word and further improve the efficiency of the verbal detection.

Optionally, fig. 4 is a schematic flow chart of another speech detection method provided in the embodiment of the present invention, as shown in fig. 4, specifically including:

s210, obtaining a classification result of the voice recognition text.

In particular, the above embodiments can detect hotwords in audio, and in some embodiments, it is also desirable to detect the semantics of audio expression. For example, in the teaching process, the composition with the audio of "the xiaohua" is written very much, and on the basis of detecting the hot word "the xiaohua", whether the semantic meaning expressed by the sentence is praise or criticism for "the xiaohua" needs to be detected. And inputting the voice recognition text into the classification model, and acquiring a semantic classification result of the voice recognition text.

Fig. 5 is a schematic structural diagram of a classification model according to an embodiment of the present invention, and as shown in fig. 5, the classification model includes a pre-training model 10 and a classifier 20, and speech recognition text is input into the pre-training model, which may be, for example, a Bidirectional Encoder Representation from transforms (BERT) model. The BERT model predicts its output text from the speech recognition text, and inputs the output text of the BERT model to a classifier, which may be, for example, a softmax classifier. The classifier classifies the output text of the BERT model according to the semantics, thereby realizing the semantic classification of the voice recognition text.

It should be noted that, in the embodiment of the present invention, it is only exemplarily shown that the pre-training model is a BERT model, and the classifier is a softmax classifier, and in practical application, the types of the pre-training model and the classifier may be flexibly selected according to actual requirements.

The use of classification models requires training of the classification models prior to semantic classification of the speech recognition text. Specifically, a plurality of voice recognition text training data are input into a pre-training model, and after multi-round training convergence of a given task, a trained pre-training model is obtained. And inputting a plurality of labeled speech recognition text training data and classification labels into the classification model, and obtaining the trained classification model after multi-round training convergence.

S220, taking the hot words corresponding to the target pinyin and the classification results corresponding to the voice recognition texts as the detection results of the dialect.

Specifically, in the above embodiment, the type of the speech recognition text is obtained through the classification model, and the corresponding speech recognition text to be detected can be obtained according to the speech recognition text, so as to obtain the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the speech recognition text to be detected. And taking the classification result of the voice recognition text and the hot words in the voice text to be detected corresponding to the voice recognition text as a final dialect detection result.

Exemplarily, the voice recognition text is Ninie smart, the Ninie smart is input into the classification model to obtain a classification result of Piyan, the corresponding voice text to be detected is obtained according to the Ninie smart, the hot word in the voice to be detected corresponding to the Ninie smart is obtained as Ninie, and the Ninie Piyan is taken as a final phony detection result.

According to the embodiment of the invention, the hot word corresponding to the target pinyin and the classification result corresponding to the voice text to be detected are used as the tactical detection result by obtaining the classification result of the voice recognition text, so that the hot word detection result of the audio and the classification result of the audio can be obtained.

Optionally, fig. 6 is a schematic flow chart of another speech detection method provided in the embodiment of the present invention, as shown in fig. 6, when executing S210 shown in fig. 4, the method specifically includes:

s213, replacing the at least one hot word in the voice recognition text with a uniform identification symbol.

Specifically, according to hot words detected by a voice text to be detected corresponding to the voice recognition text, all the hot words in the voice recognition text are replaced by the uniform identification symbols, and the replaced voice recognition text does not include the hot words.

S214, obtaining the classification result according to the replaced voice recognition text.

Specifically, the speech recognition text not including the hotword is input to the classification model, and a classification result is obtained. Because the input speech recognition text of the classification model eliminates the hot words which are irrelevant to semantic classification, namely, eliminates the interference words in the classification process, the efficiency of the dialect detection can be improved.

Optionally, fig. 7 is a schematic flow chart of another speech detection method according to an embodiment of the present invention, as shown in fig. 7, before executing S160, the method further includes:

s310, if a Chinese character is included before a first punctuation mark in the voice recognition text, correcting the Chinese character before the first punctuation mark into two same Chinese characters.

Specifically, in the process of converting audio into a speech recognition text, a swallow is easily generated for recognition of the beginning of each sentence. If a Chinese character is included before the first punctuation mark in the speech recognition text, namely the Chinese character is considered to be swallowed before the first punctuation mark, aiming at the situation, the single Chinese character is copied and pasted once before the first punctuation mark, so that the first punctuation mark comprises two same Chinese characters before. Therefore, the recall rate of the swallowed characters can be improved, errors of the voice recognition texts can be corrected, and the accuracy of the dialoging detection can be improved.

For example, if the audio content is "ni, you are very club", the beginning of the sentence is the name of a person, and there is a pause between the name of the person and the following content in the speaking habit, which is likely to cause swallowing, so the obtained speech recognition text is "ni, you are very club". Aiming at the situation, the Chinese character Ni before the first punctuation mark is corrected into two same Chinese characters Ni, namely the corrected voice recognition text is Ni, you are very good, the swallowed characters can be recalled, and the error of the voice recognition text is corrected.

S320, if the voice recognition text comprises English letters, correcting the English letters into Chinese characters with the same pronunciation as the English letters.

Specifically, in the process of recognizing chinese speech, english letters may appear in the speech recognition text, which indicates that a speech recognition error has occurred. Aiming at the situation, the English letters in the voice recognition text are replaced by the Chinese characters with the same pronunciation as the English letters, the voice recognition text is corrected, and the errors of the voice recognition text can be corrected.

And S330, acquiring the voice text to be detected according to the corrected voice recognition text.

Specifically, based on the above embodiment, the gulp of the speech recognition text and the english alphabet therein can be corrected, so as to obtain a corrected speech recognition text, which is the speech text to be detected. Therefore, the voice text to be detected is the text after error correction of the voice recognition text, and the voice text to be detected is closer to the content expressed by the audio, so that the accuracy of the phone operation detection result is improved.

Optionally, with continued reference to fig. 6, before performing S213, the method further includes:

and S211, carrying out standardization processing on the audio to be detected.

Illustratively, the audio to be detected is converted into a standardized audio to be detected, where the standardized audio to be detected is a 16K sample rate single channel Pulse Code Modulation (pcm) audio format. In practical applications, a standardized audio format may be flexibly selected, which is not specifically limited in the embodiment of the present invention.

S212, acquiring the voice recognition text according to the standardized audio to be detected.

Specifically, the standardized audio to be detected is input into the automatic speech recognition model, and the automatic speech recognition model can output the corresponding speech recognition text according to the input standardized audio to be detected. The standardized audio to be detected reserves the content of the audio to be detected, and in addition, the efficiency of voice recognition can be improved.

The embodiment of the invention also provides a speech detection device, which is used for executing any speech detection method provided by the embodiment and has corresponding beneficial effects of the speech detection method.

Fig. 8 is a schematic structural diagram of a speech detection apparatus according to an embodiment of the present invention, and as shown in fig. 8, the speech detection apparatus includes:

the hotword obtaining module 110 is configured to obtain at least one preset hotword.

The pinyin expansion module 120 is configured to, for each hot word, obtain an expanded pinyin set corresponding to the hot word, where the expanded pinyin set includes a full pinyin of the hot word and a fuzzy pinyin corresponding to the full pinyin.

A first non-standardized module 130, configured to obtain a first non-standard pinyin set corresponding to the extended pinyin set, where a part of letters of the first non-standard pinyin in the first non-standard pinyin set and a part of letters of the pinyin in the extended pinyin set satisfy a preset correspondence.

A second non-standard module 140, configured to obtain a second non-standard pinyin set corresponding to the first non-standard pinyin set, where an edit distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold.

The normalizing module 150 is configured to obtain a standard pinyin set, where a part of letters of the standard pinyin in the standard pinyin set and a part of letters of the second non-standard pinyin in the second non-standard pinyin set satisfy the preset correspondence.

The detection module 160 is configured to obtain a target pinyin that is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and use a hotword corresponding to the target pinyin as a conversational detection result.

In the technical solution provided by the embodiment of the present invention, at least one preset hotword is acquired by the hotword acquisition module 110; the pinyin expansion module 120 acquires an expanded pinyin set corresponding to each hot word, wherein the expanded pinyin set comprises the full pinyin of the hot word and the fuzzy pinyin corresponding to the full pinyin; the first non-standard module 130 obtains a first non-standard pinyin set corresponding to the extended pinyin set, wherein a part of letters of the first non-standard pinyin in the first non-standard pinyin set and a part of letters of the pinyin in the extended pinyin set satisfy a preset corresponding relationship; the second non-standard module 140 obtains a second non-standard pinyin set corresponding to the first non-standard pinyin set, wherein the edit distance between each second non-standard pinyin in the second non-standard pinyin set and the corresponding first non-standard pinyin is less than or equal to a preset threshold; the standardization module 150 acquires a standard pinyin set, wherein part of letters of the standard pinyin in the standard pinyin set and part of letters of the second non-standard pinyin in the second non-standard pinyin set meet the preset corresponding relationship; the detection module 160 obtains the target pinyin which is the same as the target pinyin in the standard pinyin set in the pinyin sequence of the voice text to be detected, and takes the hot word corresponding to the target pinyin as a dialect detection result, so that the non-standard pinyin corresponding to the full pinyin of the hot word can be expanded, the number of the standard pinyins corresponding to the full pinyin of the hot word is increased, and the accuracy of the dialect detection result is improved.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of phone detection, comprising:

acquiring at least one preset hot word;

2. The tactical detection method of claim 1, wherein said obtaining an extended pinyin set corresponding to said hotword comprises:

acquiring a full spelling of the hotword;

3. The utterance detection method as claimed in claim 1 or 2, wherein the obtaining of the first non-standard pinyin set corresponding to the extended pinyin set includes:

4. The tactical detection method of claim 3, wherein said obtaining a standard pinyin collection comprises:

5. The dialect detecting method of claim 1, wherein the obtaining of the target pinyin that is the same in the pinyin sequence of the voice text to be detected and in the standard pinyin set comprises:

6. The tactical detection method of claim 1, further comprising:

obtaining a classification result of the voice recognition text;

7. The method of claim 6, wherein the obtaining of the classification result of the speech recognition text comprises:

8. The dialect detecting method of claim 1, wherein before obtaining the target pinyin that is the same in the pinyin sequence of the voice text to be detected and the standard pinyin set, the method further comprises:

9. The utterance detection method according to claim 7, wherein the replacing of the at least one hotword in the speech recognition text with a uniform identifier; further comprising:

carrying out standardized processing on the audio to be detected;

10. A speech detection apparatus, comprising: