CN110853669A

CN110853669A - Audio identification method, device and equipment

Info

Publication number: CN110853669A
Application number: CN201911093916.5A
Authority: CN
Inventors: 贺利强; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-02-28
Anticipated expiration: 2039-11-08
Also published as: CN110853669B

Abstract

The embodiment of the application discloses a method, a device and equipment for audio recognition, and belongs to the technical field of artificial intelligence-voice correlation. Wherein, the method comprises the following steps: the method comprises the steps of obtaining pronunciation data to be recognized, extracting an acoustic feature set of the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features, conducting acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, wherein the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of the pronunciation units, conducting acoustic compensation processing on the acoustic scores of the pronunciation units in the target pronunciation unit set, and conducting text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain text information corresponding to the pronunciation data. The accuracy of audio recognition can be improved by carrying out acoustic compensation processing on the acoustic scores of the pronunciation unit set.

Description

Audio identification method, device and equipment

Technical Field

The present application relates to the field of artificial intelligence-speech correlation technology, and in particular, to the field of speech processing technology, and in particular, to an audio recognition method, an audio recognition apparatus, and an audio recognition device.

Background

The artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning and the like. Among them, the speech recognition technology (also referred to as audio recognition technology) is a technology for converting pronunciation data into corresponding text information or operation instructions, and is widely applied to various fields such as machine translation, speech search, speech input, speech dialogue, intelligent question answering, and the like. The decoder is one of core modules of a voice recognition technology, is a recognition network established on the basis of an acoustic model, a pronunciation dictionary and a language model which are optimally trained, and comprises a plurality of paths, wherein each path respectively corresponds to various text information and pronunciations of the text information; the recognition network is used for searching a path with the largest decoding score for the pronunciation data to be recognized, outputting the text content corresponding to the pronunciation data to be recognized based on the path, and completing audio recognition. In practice, it is found that under the influence of factors such as regions, the pronunciation of some words or phrases by users is inaccurate, so that the recognition network cannot accurately recognize text information corresponding to pronunciation data to be recognized, and cannot achieve the expected audio recognition effect.

Content of application

The technical problem to be solved by the embodiments of the present application is to provide an audio recognition method, apparatus and device, which can improve the accuracy of audio recognition by performing acoustic compensation processing on the acoustic scores of a set of pronunciation units,

in one aspect, an embodiment of the present application provides an audio identification method, where the method includes:

acquiring pronunciation data to be recognized, and extracting an acoustic feature set of the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features;

performing acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, wherein the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of each pronunciation unit;

performing acoustic compensation processing on the acoustic scores of all the pronunciation units in the target pronunciation unit set;

and performing text recognition on the target pronunciation unit set subjected to the acoustic compensation processing to obtain text information corresponding to the pronunciation data.

In one aspect, an embodiment of the present application provides an audio recognition apparatus, where the apparatus includes:

the acoustic feature set extraction device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring pronunciation data to be recognized and extracting an acoustic feature set of the pronunciation data, and the acoustic feature set comprises a plurality of acoustic features;

the recognition unit is used for performing acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, and the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of each pronunciation unit;

the compensation unit is used for carrying out acoustic compensation processing on the acoustic scores of all the pronunciation units in the target pronunciation unit set;

the recognition unit is further configured to perform text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain text information corresponding to the pronunciation data.

In another aspect, an embodiment of the present application provides an audio recognition apparatus, including:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

In yet another aspect, embodiments of the present application provide a computer storage medium having one or more instructions stored thereon, the one or more instructions adapted to be loaded by a processor and to perform the following steps:

In the embodiment of the application, the acoustic compensation processing is performed on the acoustic scores of the pronunciation units in the target pronunciation unit set, so that the acoustic scores of the pronunciation units in the target pronunciation unit set can be improved, and the problem that the acoustic scores of the pronunciation units are lower due to inaccurate pronunciation or insufficient pronunciation of the pronunciation units can be solved. In addition, text information corresponding to the pronunciation data is obtained by performing text recognition on the target pronunciation unit set subjected to the acoustic compensation processing, so that the accuracy of pronunciation data recognition can be improved; meanwhile, the recognition of pronunciation data of other audio words is not influenced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of an architecture of an audio recognition system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a decoder according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an acoustic model and pronunciation dictionary provided by an embodiment of the present application;

fig. 4 is a schematic flowchart of an audio recognition method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of another audio recognition method provided by the embodiments of the present application;

FIG. 6 is a schematic flowchart of another audio recognition method provided in the embodiments of the present application;

fig. 7 is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio identification device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (textto Speech, TTS) as well as voiceprint recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

The voice recognition technology involved in the artificial intelligence refers to a technology for converting pronunciation data into corresponding text information or operation instructions by using a voiceprint recognition algorithm, a voice conversion algorithm, and the like, where the pronunciation data can be input by a user or downloaded from a network, and the language of the pronunciation data may include but is not limited to: chinese, English, French, etc.; the pronunciation data may specifically be pronunciation data corresponding to a word (e.g. an english word), a word (e.g. a chinese character), a plurality of words or phrases. The audio recognition process may specifically include the following three stages: 1. carrying out a feature extraction stage on pronunciation data to be recognized; 2. obtaining pronunciation information corresponding to the pronunciation data; 3. and determining the text information stage of the pronunciation data according to the pronunciation information. The three stages are described in detail below with reference to fig. 1.

Fig. 1 is a schematic structural diagram illustrating an audio recognition system according to an exemplary embodiment of the present application, where the audio recognition system includes a server and at least one terminal. The terminal is a terminal facing a user, and specifically may be an intelligent device such as a smart phone, a tablet computer, a portable personal computer, an intelligent watch, a bracelet, and an intelligent television. The server may be an independent server, a server cluster composed of several servers, or a cloud computing center. In an exemplary embodiment of the present application, the terminal may be used to collect pronunciation data; the server can be used as an audio recognition device, that is, the server can include a decoder for audio recognition, and the server adopts the built-in decoder to perform recognition processing on the pronunciation data collected by the terminal to obtain a recognition result. In another exemplary embodiment of the present application, the server may send the decoder to the terminal, and then the terminal may be configured to collect the pronunciation data, and may also be used as an audio recognition device to directly perform recognition processing on the pronunciation data by using the decoder to obtain a recognition result. In the following embodiments of the present application, the terminal collects pronunciation data, and the server can be used as an audio recognition device to perform audio recognition on the pronunciation data collected by the terminal.

The decoder is a tool for performing audio recognition, and can be referred to as fig. 2, the decoder is a recognition network established based on an acoustic model, a pronunciation dictionary and a language model, the recognition network comprises a plurality of paths, and each path corresponds to each text information and pronunciation information of pronunciation data; the recognition network is used for searching a path with the highest decoding score for the pronunciation data to be recognized, outputting text information corresponding to the pronunciation data to be recognized based on the path, and completing audio recognition.

The acoustic model is a model for forming a large number of acoustic decoding paths corresponding to pronunciation information of pronunciation data, the pronunciation information corresponding to the pronunciation data includes at least one candidate pronunciation unit set corresponding to the pronunciation data, each candidate pronunciation unit set includes a plurality of pronunciation units and an acoustic score of each pronunciation unit, and the acoustic score may be equal to a difference between a posterior probability and a prior probability of the pronunciation unit. One acoustic decoding path corresponds to one candidate pronunciation unit set, and each acoustic decoding path is used for indicating the pronunciation sequence of each pronunciation unit in the corresponding candidate pronunciation unit set. The acoustic score is used to indicate the degree of matching between the pronunciation data and the pronunciation unit, i.e. the greater the degree of matching, the higher the acoustic score; the smaller the degree of matching, the lower the acoustic score. Meanwhile, the higher the acoustic score of each pronunciation unit in the candidate pronunciation unit set is, the higher the matching degree between each pronunciation unit in the candidate pronunciation unit set and the pronunciation data is, that is, the standard pronunciation of each pronunciation unit in the candidate pronunciation unit set is closer to the pronunciation data, that is, the higher the accuracy of the candidate pronunciation unit set is. The accurate pronunciation of the pronunciation unit can be obtained according to a large amount of pronunciation data statistics. The lower the acoustic score of each pronunciation unit in the set of candidate pronunciation units, the lower the degree of match between each pronunciation unit in the set of candidate pronunciation units and the pronunciation data, i.e., the greater the difference between the standard pronunciation and the pronunciation data for each pronunciation unit in the set of candidate pronunciation units, i.e., the lower the accuracy of the set of candidate pronunciation units. The pronunciation unit refers to a pronunciation unit of the candidate text information corresponding to the pronunciation data, and when the language of the pronunciation data is Chinese, the pronunciation unit specifically refers to a phoneme, a new vowel, a vowel and a syllable; when the language of the pronunciation data is english, the pronunciation unit may be specifically a phoneme (phone) or a word-piece (word-pieces); each pronunciation unit can be represented using a plurality of pronunciation states. For example, as shown in fig. 3, in the acoustic model example in which the language of the pronunciation data is english and the pronunciation unit is in three states, the model includes three acoustic decoding paths, namely acoustic decoding path 1, acoustic decoding path 2, and acoustic decoding path 3, a circle on the acoustic decoding path represents one pronunciation state of the pronunciation unit, and an arrow indicates the pronunciation sequence. Taking the acoustic decoding path 1 as an example, the candidate pronunciation unit set corresponding to the acoustic decoding path 1 includes pronunciation units w, ah, and n, and the pronunciation order of each pronunciation unit is w, ah, and n in sequence. The pronunciation states corresponding to all pronunciation units are w, ah, n and n respectively; that is, the pronunciation state of each pronunciation unit is represented by the corresponding pronunciation unit, but of course, the pronunciation state of the pronunciation unit can also be represented by other information, such as s1, s2, s3, etc. Where sil in the acoustic decoding path in fig. 3 represents silence, which indicates that the acoustic recognition processing of the sound data has been completed.

The pronunciation dictionary comprises a word set which can be processed by the decoder and a pronunciation unit set of each word in the word set, and can be used for mapping the pronunciation unit set to the word. The set of words may include english words, chinese characters, etc. For example, as shown in fig. 3, the pronunciation dictionary 11 includes english words and pronunciation units corresponding to the english words, and the pronunciation units of the word "one" are known to include w, ah, and n from the pronunciation dictionary. The pronunciation unit of the English word "two" includes t, uw, and the like.

The language model is a model for forming a large number of language decoding paths, the language decoding paths correspond to text information corresponding to pronunciation data, that is, one language decoding path corresponds to one candidate text information corresponding to the pronunciation data, and the candidate text information is obtained by matching a candidate pronunciation unit set with a pronunciation dictionary. The candidate text information may be composed of a word, a word or a plurality of word groups, and each candidate text information has a linguistic score indicating the similarity between each pronunciation unit in the candidate pronunciation unit set and the pronunciation unit in the pronunciation dictionary. Optionally, the linguistic score may also be used for the degree of association between the word and the context.

Based on the above description, please refer to the processing flow of audio recognition shown in fig. 4, which may include the following steps S1-S6.

And S1, the terminal acquires the pronunciation data to be recognized and sends the pronunciation data to the server, wherein the pronunciation data can be acquired by the terminal through a voice device or downloaded from a network, and the voice device can be a microphone or the like.

S2, the server acquires an acoustic feature set corresponding to the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features. In order to filter noise in the pronunciation data, firstly, filtering processing can be carried out on the pronunciation data to obtain processed pronunciation data; and performing frame processing on the processed pronunciation data to obtain multi-frame pronunciation data. Furthermore, frequency domain transformation is carried out on each frame of pronunciation subdata in the multi-frame pronunciation subdata to obtain frequency domain pronunciation subdata, and feature extraction is carried out on each frame of pronunciation subdata in the frequency domain to obtain an acoustic feature set corresponding to the pronunciation data. Each acoustic feature set comprises a plurality of acoustic features, the acoustic features in each acoustic feature set are arranged in sequence, each acoustic feature corresponds to one frame of pronunciation subdata, and the arrangement sequence of the acoustic features in the acoustic feature set corresponds to the time sequence of the collected pronunciation subdata. Here, the acoustic features are used to characterize energy, amplitude, zero crossing rate, Linear Prediction Coefficient (LPC) coefficient, etc. of the pronunciation data, and may specifically include Filter bank-based (Filter bank, Fbank) features, Mel-Frequency Cepstral Coefficients (MFCC) features, Perceptual Linear prediction coefficient (PLP) features, etc.

And S3, the server inputs the acoustic feature set into an acoustic model of the decoder for acoustic recognition processing to obtain pronunciation information corresponding to the pronunciation data. The pronunciation information comprises a plurality of candidate pronunciation unit sets, each candidate pronunciation unit set comprises a plurality of pronunciation units and the acoustic score of each pronunciation unit.

And S4, the server can inquire the candidate text information corresponding to each candidate pronunciation unit set through the pronunciation dictionary of the decoder, and calculate the language score of each candidate pronunciation unit set through the language model.

S5, the server may calculate an acoustic score of each candidate pronunciation unit set according to the acoustic score of each pronunciation unit in each candidate pronunciation unit set, where the acoustic score of the candidate pronunciation unit set may be a product of the acoustic scores of the pronunciation units in the corresponding candidate pronunciation unit set. Further, the sum of the acoustic score of each candidate pronunciation unit set and the language score of the corresponding candidate text information may be determined as the candidate text information corresponding decoding score, and the candidate text information with the highest decoding score may be selected from the plurality of candidate text information as the recognition result of the pronunciation data, thereby completing the audio recognition of the pronunciation data.

And S6, the server returns the identification result to the terminal.

The above steps S1-S2 are a stage of extracting features of pronunciation data to be recognized, the step S3 is a stage of acquiring pronunciation information corresponding to pronunciation data, the steps S4-S6 are a stage of determining text information of pronunciation data according to pronunciation information, it is found in practice that pronunciation inaccuracy or pronunciation insufficiency of some pronunciation units is easily caused due to the influence of regional accents, pronunciation variations, etc. for example, 1) pronunciation units of pronunciation variations are easily pronunciation insufficiency, where pronunciation variations may refer to pronunciation differences of the same pronunciation unit in different words, pronunciation variations may include 4 cases where ① soft-palate-like gingival-side near sounds are accompanied by soft-palate or pharynx-like pronunciation units l. ② gingival-flash sounds in words where pronunciation order of pronunciation units is after a pronunciation unit is located in a word group like all, little, where pronunciation units are easily pronounced without blasting, pronunciation units t. ③ are easy to be studded in pronunciation units like word groups, where word units do not include word groups like word groups t.78, when pronunciation units are easily read by word groups like speech units in word groups like skxt, when pronunciation units are easily read by word groups like speech units in word groups like speech units p, when the pronunciation shortage of pronunciation units, such as words, word groups b.t).

In order to improve the accuracy of audio recognition, the embodiment of the present application provides an audio recognition method, which improves the basic processing flows shown in the above-mentioned S1-S5 in several ways: (1) in the stage of acquiring the pronunciation information corresponding to the pronunciation data, the acoustic compensation processing is performed on the acoustic scores of all the pronunciation units in the pronunciation unit set of the pronunciation data. (2) In the stage of determining the text information of the pronunciation data according to the pronunciation information, the text information corresponding to the pronunciation data is obtained by performing text recognition on the pronunciation unit set after the acoustic compensation processing. Through the improvement, the problem that the acoustic score of the pronunciation unit is low due to factors such as regional accents or pronunciation variants can be solved, the pronunciation unit is subjected to acoustic compensation processing, and the appropriate acoustic score can be obtained for the pronunciation unit compensation, so that pronunciation data can be correctly decoded, and the accuracy of audio recognition of the pronunciation data is improved.

Based on the above description, the audio recognition method proposed in the embodiment of the present application may be performed by an audio recognition device, which may be, for example, the server or the terminal shown in fig. 1, referring to fig. 5. As shown in fig. 5, the audio recognition method may include the following steps S101 to S104:

s101, obtaining pronunciation data to be recognized, and extracting an acoustic feature set of the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features.

The pronunciation data refers to pronunciation data to be converted into text information. In one embodiment, the pronunciation data may be input by a user, and in particular, the terminal may include an audio control operable to capture pronunciation data, and if an operation on the audio control is detected, the audio data input by the user may be captured by a speech device of the terminal. The audio control can be a physical key or a virtual key, and the touch operation can be touch operation, cursor operation, key operation or voice operation; the operation on the audio control can be touch click operation, touch press operation or touch slide operation, and the touch operation can be single-point touch operation or multi-point touch operation; the cursor operation can be an operation of controlling a cursor to click or an operation of controlling the cursor to press; the key operation may be a virtual key operation or a physical key operation, etc. In another embodiment, the pronunciation data may be downloaded from a network, for example, a voice conversation scene, and the pronunciation data may be downloaded from a conversation window. After the pronunciation data are obtained, when the audio recognition device is a server, the terminal can send the pronunciation data to the server, the server can receive the pronunciation data and extract an acoustic feature set of the pronunciation data, and the acoustic feature set comprises a plurality of acoustic features; when the audio recognition device is a terminal, the terminal can directly extract the acoustic feature set of the pronunciation data. The manner of extracting the acoustic feature set of the utterance data may be referred to in step S1 described above.

And S102, carrying out acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, wherein the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of the pronunciation units.

The audio recognition device may input the acoustic feature set of the pronunciation data into the acoustic model for acoustic recognition processing, so as to obtain a plurality of candidate pronunciation unit sets corresponding to the pronunciation data, where the target pronunciation unit set may be any one of the plurality of candidate pronunciation unit sets. Here, the acoustic model may specifically include but is not limited to: acoustic models based on Hidden Markov Models (HMMs), such as gaussian-Hidden Markov models (GMM-HMMs) and Deep Neural Networks-Hidden Markov models (DNN-HMMs); of course, End-to-End (End to End) acoustic models may also be included, such as a Connection Timing Classification (CTC) model, a Long-Short term memory (LSTM) model, and an Attention (Attention) model.

And S103, performing acoustic compensation processing on the acoustic scores of all the pronunciation units in the target pronunciation unit set.

In order to avoid the problem that the acoustic scores of the pronunciation units are low due to inaccurate pronunciation or insufficient pronunciation of the pronunciation units, the audio recognition device may perform acoustic compensation processing on the acoustic scores of the pronunciation units in the target pronunciation unit set to improve the acoustic scores of the pronunciation units in the target pronunciation unit set. Specifically, the audio recognition device may determine whether the target pronunciation unit set satisfies an acoustic compensation condition, and if not, not perform acoustic compensation processing on the target pronunciation unit set; and if so, performing acoustic compensation processing on the target pronunciation unit set. The case where the target pronunciation unit set does not satisfy the acoustic compensation condition may refer to: each pronunciation unit in the target pronunciation unit set is fully pronounced and accurately pronounced, namely the acoustic score of each pronunciation unit in the target pronunciation unit set is higher; at this time, the matching degree of the standard pronunciation of each pronunciation unit of the target pronunciation unit set and the pronunciation data is higher, that is, the accuracy of the target pronunciation unit set is higher, and the audio recognition device can perform text recognition on the target pronunciation unit set to obtain the recognition result of the pronunciation data. Optionally, the step of the target pronunciation unit set not meeting the acoustic compensation condition may further include: the acoustic scores of most of the pronunciation units in the target pronunciation unit set are low, that is, the matching degree of the standard pronunciation of the pronunciation units in the target pronunciation unit set and the pronunciation data is low, that is, the accuracy of the target pronunciation unit set is low, and then the target pronunciation unit set can be discarded. The set of target pronunciation units satisfying the acoustic compensation condition may refer to: the acoustic scores for the few pronunciation units in the set of target pronunciation units are low, i.e., the few pronunciation units in the set of target pronunciation units are either inadequately pronounced or inaccurately pronounced.

In one embodiment, the acoustically compensating the acoustic scores of the pronunciation units in the target pronunciation unit set may specifically include: and performing acoustic compensation processing on the acoustic scores of the pronouncing units which are insufficiently pronounced or are inaccurately pronounced in the target pronunciation unit set. In another alternative embodiment, the acoustic scores of the target set of pronunciation units may be calculated based on the acoustic scores of each pronunciation unit in the target set of pronunciation units, and the acoustic scores of the target set of pronunciation units may be acoustically compensated.

And S104, performing text recognition on the target pronunciation unit set subjected to the acoustic compensation processing to obtain text information corresponding to the pronunciation data.

The audio recognition equipment can determine candidate text information corresponding to the target pronunciation unit set after the acoustic compensation processing according to the pronunciation dictionary; and calculating the language score of the candidate text information through a language model, and calculating the acoustic score of the target pronunciation unit set according to the acoustic score of each pronunciation unit in the target pronunciation unit set after the acoustic compensation processing. Further, the sum of the language score of the candidate text information and the acoustic score of the target pronunciation unit set is used as the decoding score of the candidate text information, and if the decoding score of the candidate text information is larger than a preset score threshold value, the candidate text information is used as the recognition result of the pronunciation data. When the pronunciation data corresponds to a plurality of candidate pronunciation unit sets, each candidate pronunciation unit set is identified to obtain candidate text information. And calculating the decoding score of each candidate text message by the method, and screening the candidate text message with the highest decoding score from the candidate text messages to serve as the recognition result of the pronunciation data.

In the embodiment of the application, the acoustic compensation processing is performed on the acoustic scores of the pronunciation units in the target pronunciation unit set, so that the acoustic scores of the pronunciation units in the target pronunciation unit set can be improved, and the problem that the acoustic scores of the pronunciation units are lower due to inaccurate pronunciation or insufficient pronunciation of the pronunciation units can be solved. In addition, text information corresponding to the pronunciation data is obtained by performing text recognition on the target pronunciation unit set subjected to the acoustic compensation processing, so that the accuracy of pronunciation data recognition can be improved; meanwhile, the recognition of pronunciation data of other audio words is not influenced. In addition, the method and the device do not need to optimize the acoustic model through a large amount of training data to improve the audio recognition accuracy, namely, a large amount of training data does not need to be obtained, and a large amount of iterative training is not needed to be carried out on the acoustic model, so that the difficulty of data obtaining can be reduced, and a large amount of resources can be saved.

In one embodiment, the pronunciation unit includes a plurality of pronunciation states, each pronunciation state corresponding to an acoustic feature; arranging acoustic features in the acoustic feature set of the pronunciation data in sequence; s102 may include steps S11-S13 as follows.

s11, sequentially identifying each acoustic feature in the acoustic feature set according to the arrangement order of each acoustic feature in the acoustic feature set.

s12, each time one of the pronunciation units is identified, an acoustic score is calculated for that pronunciation unit.

s13, when each acoustic feature in the acoustic feature set is recognized, obtaining the target pronunciation unit set.

Wherein the recognized order of each pronunciation unit in the target pronunciation unit set corresponds to the pronunciation order of each pronunciation unit.

In steps S11-S13, the audio recognition device may sequentially recognize the acoustic features of the acoustic feature set according to the arrangement order of the acoustic features in the acoustic feature set, where one acoustic feature corresponds to one pronunciation data and the arrangement order of the acoustic features corresponds to the order in which the pronunciation data is collected, as known from S2 above. That is, the audio recognition device may sequentially input the acoustic features into the acoustic model for recognition according to the arrangement order of the acoustic features in the acoustic feature set, and calculate the acoustic score of each pronunciation unit every time one pronunciation unit is recognized. And obtaining the target pronunciation unit set after each acoustic feature in the acoustic feature set is recognized.

Optionally, before step S103, the method further includes step S21 as follows.

S21, during the acoustic recognition process, it is determined whether the target pronunciation unit set satisfies the acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set, and if so, step S103 is executed.

In step s21, if the target pronunciation unit set does not satisfy the acoustic compensation condition, it indicates that the accuracy of the target pronunciation unit set is low, and if the target pronunciation unit set is still subjected to the acoustic compensation process, the candidate text information corresponding to the target pronunciation unit set with low accuracy is easily used as the recognition result, thereby reducing the accuracy of audio recognition. Accordingly, the audio recognition apparatus can perform the acoustic compensation process on the target pronunciation unit set when the target pronunciation unit set satisfies the acoustic compensation condition. Specifically, the audio recognition device may determine whether the target pronunciation unit set satisfies the acoustic compensation condition according to the acoustic scores of the pronunciation units in the target pronunciation unit set in real time during the acoustic recognition processing. When one pronunciation unit is recognized, judging whether the target pronunciation unit set meets the acoustic compensation condition or not according to the acoustic score of the currently recognized pronunciation unit; if the target pronunciation unit set is satisfied, indicating that the pronunciation unit set has pronunciation units which are not sufficiently pronounced or are not precisely pronounced, executing step S103; the acoustic scores of the pronunciation units in the target pronunciation unit set are compensated only when the target pronunciation unit set meets the acoustic compensation condition, so that the problem that the acoustic scores of the pronunciation units are low due to inaccurate pronunciation or insufficient pronunciation of the pronunciation units can be avoided, the target pronunciation unit set which does not meet the acoustic compensation condition is avoided from being subjected to acoustic compensation processing, and the accuracy and the effectiveness of the acoustic compensation processing on the target pronunciation unit set are improved.

In this embodiment, step s21 includes steps s 31-s 34 as follows.

And s31, judging whether the currently recognized pronunciation unit is the first pronunciation unit every time one pronunciation unit is recognized, wherein the first pronunciation unit is the pronunciation unit to be compensated which is obtained by statistics in the historical audio recognition process.

s32, if yes, verifying whether the acoustic score of the currently recognized pronunciation unit is less than the preset acoustic score threshold.

s33, if less than, counting the number of all the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit with the preset acoustic score threshold value.

s34, if the counted number is greater than the first number threshold and the acoustic scores of all the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit are greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set satisfies the acoustic compensation condition.

In steps s 31-s 34, the audio recognition apparatus may detect whether the target pronunciation unit satisfies the acoustic compensation condition in real time in conjunction with historical empirical data. Specifically, the audio recognition device judges whether the currently recognized pronunciation unit is the first pronunciation unit every time the audio recognition device recognizes one pronunciation unit. The first pronunciation unit is a pronunciation unit to be compensated obtained by statistics in the historical audio recognition process, namely the first pronunciation unit is a pronunciation unit which is easy to be insufficiently pronounced or inaccurate pronounced, namely the first pronunciation unit is a pronunciation unit which is more than a preset frequency threshold value when the acoustic score is smaller than the preset acoustic score threshold value in the historical audio recognition process. For example, in 10 times of historical audio recognition processes, the acoustic scores of the pronunciation unit t in 8 times of historical audio recognition processes are all smaller than the preset acoustic score threshold value, and therefore the pronunciation unit t is called as a first pronunciation unit. If the currently recognized pronunciation unit is the first pronunciation unit, the currently recognized pronunciation unit is a pronunciation unit which is easy to be insufficiently pronounced or inaccurate in pronunciation, and whether the acoustic score of the currently recognized pronunciation unit is smaller than the preset acoustic score threshold value is verified; if the acoustic score of the currently recognized pronunciation unit is lower, counting the number of all pronunciation units of the target pronunciation unit set, of which the pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the size of each pronunciation unit of the target pronunciation unit set, of which the pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, with a preset acoustic score threshold value. If the counted number is greater than a first number threshold and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold, it indicates that the acoustic score of only the currently recognized pronunciation unit in the recognized pronunciation units is low, that is, the currently recognized pronunciation unit is not fully pronounced or the accuracy of being pronounced is low; namely, the accuracy of the target pronunciation unit set is high, and only a few pronunciation units are not pronounced sufficiently or are not pronounced accurately; the set of target pronunciation units is determined to satisfy the acoustic compensation condition. Therefore, the acoustic compensation processing on the target pronunciation unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing on the pronunciation target unit set is improved.

Optionally, the step s21 includes the following steps s41 to s 43.

s41, each time a pronunciation unit is recognized, verifying whether the acoustic score of the currently recognized pronunciation unit is less than the preset acoustic score threshold.

s42, if less than, counting the number of all the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit with the preset acoustic score threshold value.

s43, if the statistical number is greater than a second number threshold and the acoustic scores of all the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit are greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set satisfies the acoustic compensation condition.

In steps s 41-s 43, the audio recognition apparatus can detect whether the target pronunciation unit satisfies the acoustic compensation condition in real time during the acoustic recognition process. Specifically, each time a pronunciation unit is identified, verifying whether the acoustic score of the currently identified pronunciation unit is smaller than the preset acoustic score threshold value; if the acoustic score of the currently recognized pronunciation unit is lower, counting the number of all pronunciation units of the target pronunciation unit set, of which the pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the size of each pronunciation unit of the target pronunciation unit set, of which the pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, with a preset acoustic score threshold value. If the counted number is greater than a first number threshold and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold, it indicates that the acoustic score of only the currently recognized pronunciation unit in the recognized pronunciation units is low, that is, the currently recognized pronunciation unit is not fully pronounced or the accuracy of being pronounced is low; namely, the accuracy of the target pronunciation unit set is higher, and only a few pronunciation units are not pronounced sufficiently or are not pronounced accurately; the set of target pronunciation units is determined to satisfy the acoustic compensation condition. Therefore, the acoustic compensation processing on the target pronunciation unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing on the pronunciation target unit set is improved.

In this embodiment, step S103 may include steps S51 and S52 as follows.

s51, performing acoustic compensation processing on the acoustic scores of the currently recognized pronunciation unit by using the acoustic scores of all pronunciation units in the target pronunciation unit set, wherein the pronunciation order of all pronunciation units is before the pronunciation order of the currently recognized pronunciation unit, and obtaining the compensated acoustic score of the currently recognized pronunciation unit.

s52, updating the target pronunciation unit set by the compensated acoustic score of the current recognized pronunciation unit to obtain the target pronunciation unit set after acoustic compensation processing.

In steps s51 and s52, when the acoustic recognition process detects in real time that the target pronunciation unit set satisfies the acoustic compensation condition, the acoustic compensation process can be performed on the target pronunciation unit set in real time. Specifically, the audio recognition device may perform an acoustic compensation process on the acoustic score of the currently recognized pronunciation unit by using the acoustic scores of all pronunciation units in the target pronunciation unit set, of which the pronunciation order is before the pronunciation order of the currently recognized pronunciation unit, to obtain the compensated acoustic score of the currently recognized pronunciation unit. In one embodiment, the acoustic score of the currently recognized pronunciation unit may be acoustically compensated using the largest acoustic score or the average acoustic score among the acoustic scores of all pronunciation units in the target pronunciation unit set whose pronunciation order precedes the pronunciation order of the currently recognized pronunciation unit, resulting in a compensated acoustic score of the currently recognized pronunciation unit. Optionally, an acoustic score may be randomly selected from acoustic scores of all the pronunciation units in the target pronunciation unit set, whose pronunciation order is before the pronunciation order of the currently recognized pronunciation unit, to perform acoustic compensation processing on the acoustic score of the currently recognized pronunciation unit, so as to obtain a compensated acoustic score of the currently recognized pronunciation unit. Further, the target pronunciation unit set is updated by the acoustic score of the currently recognized pronunciation unit after compensation, and a target pronunciation unit set after acoustic compensation processing is obtained. Here, only the pronunciation unit that is insufficiently pronounced or incorrectly pronounced in the target pronunciation unit set is compensated, so that the acoustic score of the target pronunciation unit set can be improved, and the accuracy of compensating the target pronunciation unit set can be improved. In addition, acoustic compensation processing on all the pronunciation unit sets in the target pronunciation unit set can be avoided, and normal recognition of pronunciation data of other audio words is not influenced.

In this embodiment, step s51 may include steps s 61-s 64 as follows.

s61, calculating a first average of the acoustic scores of all the pronunciation units in the target set of pronunciation units having a pronunciation order that precedes the pronunciation order of the currently recognized pronunciation unit.

s62, obtaining the probability that the acoustic score of the currently recognized pronunciation unit is less than the preset acoustic score threshold.

s63, determining a compensatory acoustic score for the currently identified pronunciation unit based on the first average and the probability.

s64, determining the sum of the acoustic score of the currently recognized pronunciation unit and the compensated acoustic score as the compensated acoustic score of the currently recognized pronunciation unit.

In steps s61 to s64, the audio recognition apparatus may calculate an average of the acoustic scores of all the pronunciation units in the target pronunciation unit set whose pronunciation order is before the pronunciation order of the currently recognized pronunciation unit, and perform an acoustic compensation process on the acoustic score of the currently recognized pronunciation unit to obtain a compensated acoustic score of the currently recognized pronunciation unit. Specifically, the audio recognition device may calculate a first average value of the acoustic scores of all the pronunciation units in the target pronunciation unit set, whose pronunciation sequence is before the pronunciation sequence of the currently recognized pronunciation unit, by using a preset average algorithm, where the preset average algorithm may be an arithmetic average algorithm, a statistical average algorithm, or the like. Further, acquiring the probability that the acoustic score of the currently recognized pronunciation unit is smaller than the preset acoustic score threshold, wherein the probability is obtained by statistics in the historical audio recognition processing process; and determining a compensatory acoustic score for the currently identified pronunciation unit based on the first average and the probability. Then, the sum of the acoustic score of the currently recognized pronunciation unit and the compensated acoustic score is determined as the compensated acoustic score of the currently recognized pronunciation unit, and the compensated acoustic score of the currently recognized pronunciation unit can be expressed by the following formula (1).

In the formula (1), x_nRepresenting the nth pronunciation unit, i.e. x, of the set of target pronunciation units_nIs the currently recognized pronunciation unit. P (x)_n) Representing the acoustic score, P, of the currently recognized pronunciation unit_prior(x_n) A probability that the acoustic score of the currently recognized pronunciation unit is less than the preset acoustic score threshold,

representing the first average value, α, β representing a weighting factor, which may be based on statistics in a historical audio recognition process.

Representing the compensatory acoustic score, P (x'_n) Representing the acoustic score of the compensated current recognized pronunciation unit.

Optionally, before step S103, the method further includes step S71 as follows.

S71, after the acoustic recognition process is completed, determining whether the target pronunciation unit set satisfies the acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set, and if so, executing step S103.

In step s71, if the target pronunciation unit set does not satisfy the acoustic compensation condition, it indicates that the accuracy of the target pronunciation unit set is low, and if the target pronunciation unit set is still subjected to the acoustic compensation process, the candidate text information corresponding to the target pronunciation unit set with low accuracy is easily used as the recognition result, thereby reducing the accuracy of audio recognition. Accordingly, the audio recognition apparatus can perform the acoustic compensation process on the target pronunciation unit set when the target pronunciation unit set satisfies the acoustic compensation condition. Specifically, the audio recognition device may determine whether the target pronunciation unit set satisfies the acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set after the acoustic recognition processing is completed. After all the acoustic features are recognized, judging whether the target pronunciation unit set meets acoustic compensation conditions or not according to the acoustic scores of all the pronunciation units in the target pronunciation unit set; if the target pronunciation unit set meets the requirement, the target pronunciation unit set is indicated to have pronunciation units with insufficient pronunciation or inaccurate pronunciation, and then step S103 is executed; and if not, not performing acoustic compensation processing on the target pronunciation unit set. The acoustic scores of the pronunciation units in the target pronunciation unit set are compensated only when the target pronunciation unit set meets the acoustic compensation condition, so that the problem that the acoustic scores of the pronunciation units are low due to inaccurate pronunciation or insufficient pronunciation of the pronunciation units can be avoided, the target pronunciation unit set which does not meet the acoustic compensation condition is avoided from being subjected to acoustic compensation processing, and the accuracy and the effectiveness of the acoustic compensation processing on the target pronunciation unit set are improved.

In this embodiment, step s71 includes steps s 81-s 84 as follows.

s81, after the acoustic recognition process is completed, detecting whether the target pronunciation unit set has the same target pronunciation unit as the first pronunciation unit, wherein the first pronunciation unit is the pronunciation unit to be compensated obtained by statistics in the historical audio recognition process.

s82, if yes, verifying whether the acoustic score of the target pronunciation unit is less than the preset acoustic score threshold.

s83, if the acoustic score is smaller than the preset acoustic score threshold, counting the number of all the pronunciation units in the target pronunciation unit set with the acoustic score larger than the preset acoustic score threshold.

s84, if the number of statistics is greater than a third number threshold, determining that the set of target pronunciation units satisfies an acoustic compensation condition.

In steps s 81-s 84, the audio recognition device may combine historical empirical data to detect whether the set of target pronunciation units satisfies the acoustic compensation condition after the acoustic recognition process is completed. Specifically, the audio recognition device may detect whether a target pronunciation unit identical to the first pronunciation unit exists in the target pronunciation unit set after the acoustic recognition processing is completed, and if the target pronunciation unit exists, the audio recognition device indicates that the target pronunciation unit is a pronunciation unit which is easy to be insufficiently pronounced or inaccurate pronounced, and then verifies whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold. If the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold, counting the number of all pronunciation units with acoustic scores larger than the preset acoustic score threshold in the target pronunciation unit set. If the statistical number is greater than the third number threshold, it indicates that the acoustic scores of most of the pronunciation units in the target pronunciation unit set are higher, and the acoustic scores of a few of the pronunciation units are lower, i.e. only the target pronunciation unit in the target pronunciation unit set is inadequately pronounced or the pronounced accuracy is lower; namely, the accuracy of the target pronunciation unit set is higher, and only a few pronunciation units are not pronounced sufficiently or are not pronounced accurately; the set of target pronunciation units is determined to satisfy the acoustic compensation condition. Therefore, the acoustic compensation processing on the target pronunciation unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing on the pronunciation target unit set is improved. It should be noted that the preset acoustic score threshold, the first number threshold, the second number threshold, the third number threshold, and the fourth number threshold may be obtained by counting historical audio recognition, where the first number threshold, the second number threshold, the third number threshold, and the fourth number threshold may be specifically dynamically adjusted according to the number of pronunciation words in the target pronunciation unit set.

In this embodiment, step s71 includes steps s 91-s 93 as follows.

s91, after the acoustic recognition process is completed, determining whether there is a target pronunciation unit in the target pronunciation unit set with an acoustic score smaller than a preset acoustic score threshold.

s92, counting the number of all pronunciation units in the target pronunciation unit set whose acoustic score is larger than the preset acoustic score threshold value.

s93, if the number of statistics is greater than a fourth number threshold, determining that the set of target pronunciation units satisfies an acoustic compensation condition.

In steps s 91-s 93, the audio recognition apparatus may detect whether the set of target pronunciation units satisfies the acoustic compensation condition after the acoustic recognition process is completed. Specifically, after the acoustic recognition processing is completed, it is determined whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set, and if the acoustic score smaller than the preset acoustic score threshold indicates that the acoustic score of the target pronunciation unit is low, the number of all pronunciation units with acoustic scores larger than the preset acoustic score threshold in the target pronunciation unit set is counted. If the statistical number is greater than the third number threshold, it indicates that the acoustic scores of most of the pronunciation units in the target pronunciation unit set are higher, and the acoustic scores of a few of the pronunciation units are lower, i.e. only the target pronunciation unit in the target pronunciation unit set is inadequately pronounced or the pronounced accuracy is lower; namely, the accuracy of the target pronunciation unit set is higher, and only a few pronunciation units are not pronounced sufficiently or are not pronounced accurately; the set of target pronunciation units is determined to satisfy the acoustic compensation condition. Therefore, the acoustic compensation processing on the target pronunciation unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing on the pronunciation target unit set is improved.

In this embodiment, step S103 includes the following steps S111 to S112.

And s111, performing acoustic compensation processing on the acoustic score of the target pronunciation unit by using the acoustic scores of other pronunciation units except the target pronunciation unit in the target pronunciation unit set to obtain the compensated acoustic score of the target pronunciation unit.

And s112, updating the target pronunciation unit set by using the compensated acoustic score of the target pronunciation unit to obtain a target pronunciation unit set subjected to acoustic compensation processing.

In steps s111 to s112, when it is detected that the target pronunciation unit set satisfies the acoustic compensation condition after the acoustic recognition processing is completed, the acoustic compensation processing may be performed on the target pronunciation unit set. Specifically, the acoustic score of the target pronunciation unit may be compensated by using the acoustic scores of other pronunciation units in the set of target pronunciation units except the target pronunciation unit, so as to obtain the compensated acoustic score of the target pronunciation unit. In one embodiment, the acoustic scores of the target pronunciation units may be compensated by performing acoustic compensation processing on the acoustic scores of the target pronunciation units using the average acoustic score and the maximum acoustic score of the acoustic scores of other pronunciation units in the set of target pronunciation units except the target pronunciation unit. In another embodiment, an acoustic score may be randomly selected from the acoustic scores of other pronunciation units in the target pronunciation unit set except the target pronunciation unit to perform acoustic compensation processing on the acoustic score of the target pronunciation unit, so as to obtain a compensated acoustic score of the target pronunciation unit. And further, updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain the target pronunciation unit set subjected to acoustic compensation processing. Here, only the pronunciation units in the target pronunciation unit set that are insufficiently pronounced or inaccurately pronounced are compensated, so that the acoustic score of the pronunciation units in the target pronunciation unit set can be improved, and the accuracy of acoustic compensation of the target pronunciation unit set can be improved.

In this embodiment, step s111 includes steps s211 to s214 as follows.

s211, calculating a second average value of the acoustic scores of the other pronunciation units in the target pronunciation unit set except the target pronunciation unit.

And s212, acquiring the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold.

And s213, determining a compensated acoustic score for the target pronunciation unit according to the second average value and the probability.

And s214, determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit.

In steps s211 to s214, the audio recognition device may perform an acoustic compensation process on the acoustic score of the target pronunciation unit by using the average value of the acoustic scores of the pronunciation units other than the target pronunciation unit in the target pronunciation unit set, so as to obtain a compensated acoustic score of the target pronunciation unit. Specifically, the audio recognition device may calculate a second average value of the acoustic scores of the other pronunciation units in the target pronunciation unit set except for the target pronunciation unit by using a preset average algorithm. Further, obtaining the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold, and determining the compensated acoustic score of the target pronunciation unit according to the second average value and the probability; and determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit. The compensated acoustic score of the target pronunciation unit can be expressed by the following formula (2).

In the formula (2), x_iRepresenting the ith pronunciation unit, i.e. x, in the set of target pronunciation units_iIs a target pronunciation unit. P (x)_i) Representing the acoustic score, P, of the target pronunciation unit_prior(x_i) The acoustic score representing the target pronunciation unit is smallIn response to the probability of the preset acoustic score threshold,

representing the second average value and N representing the number of pronunciation units in the target pronunciation unit.

Representing an acoustic compensation score. P (x'_i) Representing the compensated acoustic score of the target pronunciation unit.

In one embodiment, step S104 includes steps S311-S313 as follows.

And s311, performing text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain candidate text information corresponding to the pronunciation data and a linguistic score of the candidate text information.

And s312, determining the acoustic score of the target pronunciation unit set according to the acoustic scores of all pronunciation units in the target pronunciation unit set after the acoustic compensation processing.

s313, if the sum of the acoustic score and the linguistic score of the candidate text information is greater than a preset score threshold, determining the candidate text information as the text information corresponding to the pronunciation data.

In steps s311 to s313, the audio recognition apparatus may query the candidate text information corresponding to the target pronunciation unit set after the acoustic compensation processing through the pronunciation dictionary, and calculate the language score of the candidate text information according to the language model. And determining the product of the acoustic scores of all the pronunciation units in the target pronunciation unit set after the acoustic compensation processing as the acoustic score of the target pronunciation unit set. And if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold value, determining the candidate text information as the text information corresponding to the pronunciation data.

In one embodiment, step S104 is followed by steps S411-S413 as follows.

And s411, detecting whether the text information corresponding to the pronunciation data comprises a field matched with the operation instruction.

And s412, if so, generating a target operation command according to the text information corresponding to the pronunciation data.

And s413, sending the target operation instruction to the terminal, and executing the target operation instruction by the terminal.

In steps s411 to s413, the audio recognition apparatus may generate the operation instruction according to the text information corresponding to the pronunciation data. Specifically, it may be detected whether a field matching the operation instruction is included in the text information corresponding to the pronunciation data, for example, the field may include "open", "close", "start", and the like. If so, the audio recognition device can generate a target operation instruction according to the text information corresponding to the pronunciation data, and when the audio recognition device is a server, the server can send the target operation instruction to the terminal, and the terminal executes the target operation instruction; when the audio recognition device is a terminal, the terminal can execute the target operation instruction.

The audio recognition method provided by the application can be applied to scenes such as automatic translation, voice search, voice input, voice conversation and the like, and the following describes the application in detail by taking the method as an example of being applied to a voice search scene and taking an audio recognition device as a server. Referring to fig. 6, fig. 6 is a method for audio recognition provided by the present application.

As shown in fig. 6, the terminal includes a search interface 12, the search interface 12 includes an audio control 13 and a text input box 14, and the search interface may refer to a browser, a user interface of a social application program, and the like; the text entry box allows the user to enter text information to be searched. When the terminal detects the click operation on the audio control 13, the terminal can acquire audio data input by the user through the voice device and send the audio data to the server.

As shown in fig. 6, the server may obtain acoustic feature sets corresponding to pronunciation data. Specifically, the pronunciation data can be filtered to obtain processed pronunciation data; and performing frame processing on the processed pronunciation data to obtain multi-frame pronunciation data. Furthermore, frequency domain transformation is carried out on each frame of pronunciation subdata in the multi-frame pronunciation subdata to obtain frequency domain pronunciation subdata, and feature extraction is carried out on each frame of pronunciation subdata in the frequency domain to obtain an acoustic feature set corresponding to the pronunciation data. The acoustic feature set comprises a plurality of acoustic features, the acoustic features in the acoustic feature set are arranged in sequence, and each acoustic feature corresponds to one frame of pronunciation data.

As shown in fig. 6, the server may perform acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a plurality of candidate pronunciation unit sets corresponding to the pronunciation data, where each candidate pronunciation unit set includes a plurality of pronunciation units and an acoustic score of each pronunciation unit, and here, taking three candidate pronunciation unit sets as an example, a candidate pronunciation unit set 1, a selective pronunciation unit set 2, and a candidate pronunciation unit set 3, respectively. And after the acoustic recognition processing is finished, detecting whether each candidate unit set meets acoustic compensation conditions according to the acoustic scores of the pronunciation units in each candidate pronunciation unit set. That is, if it is detected that the acoustic scores of the pronunciation units in the candidate pronunciation unit set 2 are all smaller than the preset acoustic score threshold, it is determined that the candidate pronunciation unit set 2 does not satisfy the acoustic compensation condition. And if the acoustic scores of all the pronunciation units in the candidate pronunciation unit set 3 are detected to be greater than or equal to the preset acoustic score threshold, determining that the candidate pronunciation unit set 3 does not meet the acoustic compensation condition. And if it is detected that a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the candidate pronunciation unit set 1 and the number of all pronunciation units with acoustic scores larger than the preset acoustic score threshold in the candidate pronunciation unit set 1 is larger than a fourth number threshold, determining that the candidate pronunciation unit set 1 does not meet the acoustic compensation condition. For example, the candidate pronunciation unit set 1 includes pronunciation units n, e, k, s, t, and when it is detected that the acoustic score of the pronunciation unit t is smaller than the preset acoustic score threshold, and the acoustic scores of other pronunciation units are all greater than or equal to the preset acoustic score threshold, it is determined that the candidate pronunciation unit set 1 satisfies the acoustic compensation condition. Further, the average value of the acoustic scores of the phonetic units n, e, k, and s can be calculated, and the acoustic compensation processing is performed on the acoustic score of the phonetic unit t based on the average value. And finally, respectively performing text recognition on the pronunciation unit set 1 and the candidate pronunciation unit set 3 after the acoustic compensation processing to obtain candidate text information 1 and candidate text information 2 corresponding to pronunciation data and a decoding score of each candidate text information. The candidate text information having the highest decoding score is selected from the candidate text information 1 and the candidate text information 2 as the text information of the pronunciation data.

As shown in fig. 6, the server may transmit text information corresponding to the pronunciation data to the terminal, and the terminal may display the text information in the text input box 14 in the search interface if the text information includes next. Optionally, the server may further generate a search instruction according to the text information, and send the search instruction to the terminal, where the search instruction is used to instruct the terminal to search for the entry associated with the text information. The terminal can receive the search instruction, execute the search instruction and output a plurality of entries related to the text information.

The embodiment of the application provides an audio recognition device, which can be arranged in an audio recognition device, for example, the audio recognition device can be a decoder in the audio recognition device or an application program with a decoding function; referring to fig. 7, the apparatus includes:

an obtaining unit 701, configured to obtain pronunciation data to be recognized, and extract an acoustic feature set of the pronunciation data, where the acoustic feature set includes a plurality of acoustic features;

a recognition unit 702, configured to perform acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, where the target pronunciation unit set includes a plurality of pronunciation units and an acoustic score of each pronunciation unit;

a compensation unit 703, configured to perform acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set;

the recognition unit 702 is further configured to perform text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain text information corresponding to the pronunciation data.

Optionally, the obtaining unit 701 is configured to sequentially identify each acoustic feature in the acoustic feature set according to an arrangement order of each acoustic feature in the acoustic feature set; calculating an acoustic score for each of the pronunciation units identified; obtaining the target pronunciation unit set after each acoustic feature in the acoustic feature set is recognized; wherein the recognized order of each pronunciation unit in the target pronunciation unit set corresponds to the pronunciation order of each pronunciation unit.

Optionally, the determining unit 704 is configured to determine, during the acoustic recognition processing, whether the target pronunciation unit set satisfies an acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set; or after the acoustic recognition processing is finished, judging whether the target pronunciation unit set meets an acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set; and if so, executing the step of performing acoustic compensation processing on the acoustic scores of all the pronunciation units in the target pronunciation unit set.

Optionally, the determining unit 704 is specifically configured to determine, every time a pronunciation unit is identified, whether the currently identified pronunciation unit is a first pronunciation unit, where the first pronunciation unit is a pronunciation unit to be compensated obtained by statistics in the historical audio identification process; if so, verifying whether the acoustic score of the currently recognized pronunciation unit is smaller than a preset acoustic score threshold value; if the number of the pronunciation units in the target pronunciation unit set is less than the preset acoustic score threshold, counting the number of all pronunciation units in the target pronunciation unit set, wherein the pronunciation sequence of all pronunciation units is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the size between each pronunciation unit in the target pronunciation unit set, wherein the pronunciation sequence of which is before the pronunciation sequence of the currently recognized pronunciation unit, and the preset acoustic score threshold; and if the counted number is greater than a first number threshold value, and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold value, determining that the target pronunciation unit set meets the acoustic compensation condition.

Optionally, the determining unit 704 is specifically configured to verify whether the acoustic score of the currently recognized pronunciation unit is smaller than a preset acoustic score threshold value every time a pronunciation unit is recognized; if the number of the pronunciation units in the target pronunciation unit set is less than the preset acoustic score threshold, counting the number of all pronunciation units in the target pronunciation unit set, wherein the pronunciation sequence of all pronunciation units is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the size between each pronunciation unit in the target pronunciation unit set, wherein the pronunciation sequence of which is before the pronunciation sequence of the currently recognized pronunciation unit, and the preset acoustic score threshold; and if the counted number is greater than a second number threshold, and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.

Optionally, the compensating unit 703 is specifically configured to perform acoustic compensation processing on the acoustic score of the currently recognized pronunciation unit by using the acoustic scores of all pronunciation units in the target pronunciation unit set, of which the pronunciation order is before the pronunciation order of the currently recognized pronunciation unit, to obtain a compensated acoustic score of the currently recognized pronunciation unit; and updating the target pronunciation unit set by adopting the compensated acoustic score of the currently recognized pronunciation unit to obtain a target pronunciation unit set subjected to acoustic compensation processing.

Optionally, the compensating unit 703 is specifically configured to calculate a first average value of acoustic scores of all the pronunciation units in the target pronunciation unit set, whose pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit; acquiring the probability that the acoustic score of the currently recognized pronunciation unit is smaller than the preset acoustic score threshold; determining a compensatory acoustic score for the currently identified pronunciation unit based on the first average and the probability; determining a sum of the acoustic score of the currently recognized pronunciation unit and the compensated acoustic score as a compensated acoustic score of the currently recognized pronunciation unit.

Optionally, the audio recognition apparatus further includes: a determining unit 704, configured to detect whether a target pronunciation unit identical to a first pronunciation unit exists in the target pronunciation unit set after the acoustic recognition processing is completed, where the first pronunciation unit is a pronunciation unit to be compensated obtained by statistics in a historical audio recognition process; if yes, verifying whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold value; if the acoustic score is smaller than the preset acoustic score threshold, counting the number of all the pronunciation units with the acoustic scores larger than or equal to the preset acoustic score threshold in the target pronunciation unit set; and if the counted number is greater than a third number threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.

Optionally, the determining unit 704 is specifically configured to determine, after the acoustic recognition processing is completed, whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set;

if yes, counting the number of all the pronunciation units with acoustic scores larger than or equal to the preset acoustic score threshold value in the target pronunciation unit set; and if the counted number is greater than a fourth number threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.

Optionally, the compensating unit 703 is specifically configured to perform acoustic compensation processing on the acoustic score of the target pronunciation unit by using the acoustic scores of other pronunciation units in the target pronunciation unit set except for the target pronunciation unit, so as to obtain a compensated acoustic score of the target pronunciation unit; and updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain a target pronunciation unit set subjected to acoustic compensation processing.

Optionally, the compensating unit 703 is specifically configured to calculate a second average value of acoustic scores of other pronunciation units in the target pronunciation unit set except for the target pronunciation unit; acquiring the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold; determining a compensated acoustic score for the target pronunciation unit based on the second average and the probability; determining a sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit.

Optionally, the recognition unit 702 is specifically configured to perform text recognition on the target pronunciation unit set after the acoustic compensation processing, so as to obtain candidate text information corresponding to the pronunciation data and a linguistic score of the candidate text information; determining the acoustic scores of the target pronunciation unit set according to the acoustic scores of all pronunciation units in the target pronunciation unit set after the acoustic compensation processing; and if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold value, determining the candidate text information as the text information corresponding to the pronunciation data.

Optionally, the audio recognition apparatus further includes: the generating unit 705 is configured to detect whether a field matched with an operation instruction is included in text information corresponding to the pronunciation data; if yes, generating a target operation instruction according to the text information corresponding to the pronunciation data, sending the target operation instruction to a terminal, and executing the target operation instruction by the terminal.

An embodiment of the present application provides an audio recognition apparatus, please refer to fig. 8. The audio recognition apparatus includes: the processor 151, the user interface 152, the network interface 154, and the storage device 155 are connected via a bus 153.

A user interface 152 for enabling human-computer interaction, which may include a display screen or a keyboard, among others. And a network interface 154 for communication connection with an external device. A storage device 155 is coupled to processor 151 for storing various software programs and/or sets of instructions. In particular implementations, storage 155 may include high speed random access memory and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The storage device 155 may store an operating system (hereinafter referred to simply as a system), such as an embedded operating system like ANDROID, IOS, WINDOWS, or LINUX. The storage 155 may also store a network communication program that may be used to communicate with one or more additional devices, one or more application audio recognition devices, and one or more audio recognition devices. The storage device 155 may further store a user interface program, which may vividly display the content of the application program through a graphical operation interface, and receive a user's control operation of the application program through input controls such as menus, dialog boxes, and buttons. The storage device 155 may also store an acoustic model, a language model, a pronunciation dictionary, and the like.

In one embodiment, the storage 155 may be used to store one or more instructions; the processor 151 may be capable of implementing an audio recognition method when invoking the one or more instructions, and specifically, the processor 151 invokes the one or more instructions to perform the following steps:

Optionally, the processor calls an instruction to perform the following steps:

the acoustic recognition processing on the acoustic feature set of the pronunciation data includes:

sequentially identifying each acoustic feature in the acoustic feature set according to the arrangement sequence of each acoustic feature in the acoustic feature set;

calculating an acoustic score for each of the pronunciation units identified;

obtaining the target pronunciation unit set after each acoustic feature in the acoustic feature set is recognized;

Optionally, the processor calls an instruction to perform the following steps:

in the acoustic recognition processing process, judging whether the target pronunciation unit set meets an acoustic compensation condition according to the acoustic scores of all pronunciation units in the target pronunciation unit set; or after the acoustic recognition processing is finished, judging whether the target pronunciation unit set meets an acoustic compensation condition according to the acoustic score of each pronunciation unit in the target pronunciation unit set;

and if so, executing the step of performing acoustic compensation processing on the acoustic scores of all the pronunciation units in the target pronunciation unit set.

Optionally, the processor calls an instruction to perform the following steps:

judging whether the currently recognized pronunciation unit is a first pronunciation unit or not when one pronunciation unit is recognized, wherein the first pronunciation unit is a pronunciation unit to be compensated obtained through statistics in the historical audio recognition process;

if so, verifying whether the acoustic score of the currently recognized pronunciation unit is smaller than a preset acoustic score threshold value;

if the number of the pronunciation units in the target pronunciation unit set is less than the preset acoustic score threshold, counting the number of all pronunciation units in the target pronunciation unit set, wherein the pronunciation sequence of all pronunciation units is before the pronunciation sequence of the currently recognized pronunciation unit, and comparing the size between each pronunciation unit in the target pronunciation unit set, wherein the pronunciation sequence of which is before the pronunciation sequence of the currently recognized pronunciation unit, and the preset acoustic score threshold;

and if the counted number is greater than a first number threshold value, and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold value, determining that the target pronunciation unit set meets the acoustic compensation condition.

Optionally, the processor calls an instruction to perform the following steps:

every time a pronunciation unit is identified, verifying whether the acoustic score of the currently identified pronunciation unit is smaller than a preset acoustic score threshold value;

and if the counted number is greater than a second number threshold, and the acoustic scores of all the pronunciation units in the target pronunciation unit set, of which the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, are greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.

Optionally, the processor calls an instruction to perform the following steps:

performing acoustic compensation processing on the acoustic scores of the currently recognized pronunciation units by using the acoustic scores of all pronunciation units in the target pronunciation unit set, wherein the pronunciation sequence of all pronunciation units is before the pronunciation sequence of the currently recognized pronunciation unit, so as to obtain the compensated acoustic score of the currently recognized pronunciation unit;

and updating the target pronunciation unit set by adopting the compensated acoustic score of the currently recognized pronunciation unit to obtain a target pronunciation unit set subjected to acoustic compensation processing.

Optionally, the processor calls an instruction to perform the following steps:

calculating a first average of the acoustic scores of all the pronunciation units in the target set of pronunciation units whose pronunciation order precedes the pronunciation order of the currently recognized pronunciation unit;

acquiring the probability that the acoustic score of the currently recognized pronunciation unit is smaller than the preset acoustic score threshold;

determining a compensatory acoustic score for the currently identified pronunciation unit based on the first average and the probability;

determining a sum of the acoustic score of the currently recognized pronunciation unit and the compensated acoustic score as a compensated acoustic score of the currently recognized pronunciation unit.

Optionally, the processor calls an instruction to perform the following steps:

after the acoustic recognition processing is finished, detecting whether a target pronunciation unit identical to a first pronunciation unit exists in the target pronunciation unit set, wherein the first pronunciation unit is a pronunciation unit to be compensated obtained through statistics in the historical audio recognition process;

if yes, verifying whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold value;

if the acoustic score is smaller than the preset acoustic score threshold, counting the number of all the pronunciation units with the acoustic scores larger than or equal to the preset acoustic score threshold in the target pronunciation unit set;

and if the counted number is greater than a third number threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.

Optionally, the processor calls an instruction to perform the following steps:

after the acoustic recognition processing is finished, judging whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set or not;

if yes, counting the number of all the pronunciation units with acoustic scores larger than or equal to the preset acoustic score threshold value in the target pronunciation unit set;

and if the counted number is greater than a fourth number threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.

Optionally, the processor calls an instruction to perform the following steps:

performing acoustic compensation processing on the acoustic scores of the target pronunciation units by adopting the acoustic scores of other pronunciation units in the target pronunciation unit set except the target pronunciation unit to obtain the compensated acoustic scores of the target pronunciation units;

and updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain a target pronunciation unit set subjected to acoustic compensation processing.

Optionally, the processor calls an instruction to perform the following steps:

calculating a second average value of the acoustic scores of other pronunciation units in the target pronunciation unit set except the target pronunciation unit;

acquiring the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold;

determining a compensated acoustic score for the target pronunciation unit based on the second average and the probability;

determining a sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit.

Optionally, the processor calls an instruction to perform the following steps:

performing text recognition on the target pronunciation unit set subjected to the acoustic compensation processing to obtain candidate text information corresponding to the pronunciation data and a linguistic score of the candidate text information;

determining the acoustic scores of the target pronunciation unit set according to the acoustic scores of all pronunciation units in the target pronunciation unit set after the acoustic compensation processing;

and if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold value, determining the candidate text information as the text information corresponding to the pronunciation data.

Optionally, the processor calls an instruction to perform the following steps:

detecting whether a field matched with an operation instruction is included in text information corresponding to the pronunciation data;

if yes, generating a target operation instruction according to the text information corresponding to the pronunciation data, sending the target operation instruction to a terminal, and executing the target operation instruction by the terminal.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored thereon, and reference may be made to the above-mentioned embodiment and advantageous effects of the audio identification method illustrated in fig. 2 for implementation and advantageous effects of the program in order to solve the problem, and repeated details are not repeated.

The above disclosure is only a few examples of the present application, and certainly should not be taken as limiting the scope of the present application, which is therefore intended to cover all modifications that are within the scope of the present application and which are equivalent to the claims.

Claims

1. A method for audio recognition, the method comprising:

2. The method of claim 1, wherein the pronunciation unit includes a plurality of pronunciation states, each pronunciation state corresponding to an acoustic feature; arranging acoustic features in the acoustic feature set of the pronunciation data in sequence;

calculating an acoustic score for each of the pronunciation units identified;

3. The method of claim 2, wherein the method further comprises:

4. The method as claimed in claim 3, wherein the determining whether the target pronunciation unit set satisfies an acoustic compensation condition according to the acoustic scores of the pronunciation units in the target pronunciation unit set during the acoustic recognition processing comprises:

5. The method as claimed in claim 3, wherein the determining whether the target pronunciation unit set satisfies an acoustic compensation condition according to the acoustic scores of the pronunciation units in the target pronunciation unit set during the acoustic recognition processing comprises:

6. The method of claim 4 or 5, wherein said acoustically compensating the acoustic scores of each pronunciation unit in the set of target pronunciation units comprises:

7. The method of claim 6, wherein said acoustically compensating the acoustic score of the currently recognized pronunciation unit using the acoustic scores of all pronunciation units in the set of target pronunciation units whose pronunciation order precedes the pronunciation order of the currently recognized pronunciation unit to obtain a compensated acoustic score of the currently recognized pronunciation unit comprises:

8. The method of claim 3, wherein determining whether the set of target pronunciation units satisfies the acoustic compensation condition based on the acoustic scores of the respective pronunciation units in the set of target pronunciation units after the acoustic recognition process is completed comprises:

9. The method of claim 3, wherein determining whether the set of target pronunciation units satisfies the acoustic compensation condition based on the acoustic scores of the respective pronunciation units in the set of target pronunciation units after the acoustic recognition process is completed comprises:

10. The method of claim 8 or 9, wherein said acoustically compensating the acoustic scores for each of the set of target pronunciation units comprises:

11. The method of claim 10, wherein the acoustically compensating the acoustic score of the target pronunciation unit using the acoustic scores of other pronunciation units in the set of target pronunciation units except the target pronunciation unit to obtain a compensated acoustic score of the target pronunciation unit comprises:

12. The method of claim 1, wherein the performing text recognition on the acoustically compensated target set of pronunciation units to obtain text information corresponding to the pronunciation data comprises:

13. The method of claim 1, wherein the method further comprises:

14. An audio recognition apparatus, characterized in that the apparatus comprises:

15. An audio recognition device, comprising:

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the method of any of claims 1-13.