CN113593523A

CN113593523A - Speech detection method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN113593523A
Application number: CN202110074985.2A
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-11-02
Anticipated expiration: 2041-01-20
Also published as: CN113593523B

Abstract

The application provides a voice detection method, a voice detection device, electronic equipment and a computer-readable storage medium based on artificial intelligence; the method comprises the following steps: dividing an audio signal into a plurality of pronunciation segments, and acquiring the audio characteristics of each pronunciation segment; based on the audio features of each pronunciation segment, carrying out voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment; performing language classification processing on each pronunciation segment based on the audio features of each pronunciation segment to obtain a language classification result of each pronunciation segment; and determining a human voice classification result of the audio signal based on the human voice classification result of each pronunciation segment, and determining a language classification result of the audio signal based on the language classification result of each pronunciation segment. By the method and the device, the instantaneity and the accuracy of voice recognition can be improved.

Description

Speech detection method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to artificial intelligence technology, and in particular, to a method and an apparatus for speech detection based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

More and more artificial intelligence products have a voice interaction function, and the voice interaction can be applied to various voice scoring systems, such as an encyclopedia question-answering system, a language testing system applied to language education, a spoken language examination system, an intelligent assistant control system, a voice input system embedded in a client, a voice control system embedded in the client and the like.

Disclosure of Invention

The embodiment of the application provides a voice detection method and device based on artificial intelligence, an electronic device and a computer readable storage medium, which can improve the real-time performance and accuracy of voice recognition.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a voice detection method based on artificial intelligence, which comprises the following steps:

dividing an audio signal into a plurality of pronunciation segments, and acquiring the audio characteristics of each pronunciation segment;

based on the audio features of each pronunciation segment, carrying out voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment;

performing language classification processing on each pronunciation segment based on the audio features of each pronunciation segment to obtain a language classification result of each pronunciation segment;

and determining a human voice classification result of the audio signal based on the human voice classification result of each pronunciation segment, and determining a language classification result of the audio signal based on the language classification result of each pronunciation segment.

The embodiment of the application provides a pronunciation detection device based on artificial intelligence, includes:

the acquisition module is used for dividing the audio signal into a plurality of pronunciation segments and acquiring the audio characteristics of each pronunciation segment;

the voice module is used for carrying out voice classification processing on each pronunciation segment based on the audio frequency characteristics of each pronunciation segment to obtain a voice classification result of each pronunciation segment;

a language module, configured to perform language classification processing on each pronunciation segment based on an audio feature of each pronunciation segment to obtain a language classification result of each pronunciation segment;

and the result module is used for determining the human voice classification result of the audio signal based on the human voice classification result of each pronunciation segment and determining the language classification result of the audio signal based on the language classification result of each pronunciation segment.

In the foregoing solution, the obtaining module is further configured to: determining a speech energy of each audio frame in the audio signal; and combining a plurality of continuous audio frames with the voice energy larger than the background noise energy in the audio signal into a pronunciation segment.

In the foregoing solution, the obtaining module is further configured to: performing framing processing on the audio signal to obtain a plurality of audio frames corresponding to the audio signal; performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame; wherein the audio frame classification features include at least one of: log frame energy features; a zero-crossing rate characteristic; normalizing the autocorrelation characteristics; performing classification processing based on the audio frame classification characteristics on each audio frame through the audio frame classification network, and combining a plurality of continuous audio frames of which the classification results are pronunciation data into pronunciation fragments; the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeling classification results of the audio frame samples.

In the above solution, the voice classification processing and the language classification processing are implemented by a multi-classification task model, where the multi-classification task model includes a voice classification network and a language classification network; the human voice module is further configured to: carrying out forward transmission on the audio features of each pronunciation segment in the voice classification network to obtain a voice classification result of each pronunciation segment; the language module is further configured to: and forward transmitting the audio features of each pronunciation segment in the language classification network to obtain a language classification result of each pronunciation segment.

In the above scheme, the human voice module is further configured to: performing first full-connection processing on each pronunciation segment through a shared full-connection layer of the human voice classification network and the language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each human voice classification label; determining the voice classification label with the highest probability as the voice classification result of each pronunciation segment; the language module is further configured to: performing third full-connection processing on each pronunciation segment through the language classification network and a shared full-connection layer of the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment; performing fourth full-link processing on the third full-link processing result of each pronunciation segment through a language full-link layer of the language classification network to obtain a fourth full-link processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each language classification label; and determining the language classification label with the maximum probability as the language classification result of each pronunciation segment.

In the scheme, the audio features of each pronunciation segment are acquired through a shared feature network in a multi-classification task model; the obtaining module is further configured to: converting the type of each pronunciation segment from a time domain signal to a frequency domain signal, and performing Mel calculation on each pronunciation segment converted to the frequency domain signal to obtain a frequency spectrum of Mel scales of each pronunciation segment; and forward transmitting the frequency spectrum of the Mel scale of each pronunciation segment in the shared feature network to obtain the audio features corresponding to each pronunciation segment.

In the above scheme, the shared feature network includes N cascaded feature extraction networks, where N is an integer greater than or equal to 2; the obtaining module is further configured to: performing feature extraction processing on input of an nth feature extraction network in N cascaded feature extraction networks; transmitting the nth feature extraction result output by the nth feature extraction network to an n +1 th feature extraction network to continue feature extraction processing; wherein N is an integer with the value increasing from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; and when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation segment, and when the value of N is not less than 2 and not more than N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

In the above scheme, the nth feature extraction network includes a convolutional layer, a normalization layer, a linear rectifying layer, and an average pooling layer; the obtaining module is further configured to: performing convolution processing on the input of the nth feature extraction network and convolution layer parameters of a convolution layer of the nth feature extraction network to obtain an nth convolution layer processing result; normalizing the nth convolution layer processing result through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth feature extraction network to obtain an nth linear rectification processing result; and carrying out average pooling on the nth linear rectification processing result through an average pooling layer of the nth feature extraction network to obtain an nth feature extraction result.

In the above scheme, the human voice module is further configured to: performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal; when the voice classification processing is adapted to the multiple candidate classification processing, performing voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment; the language module is further configured to: performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal; and when the language classification processing is adapted to the language classification processing in the candidate classification processing, performing language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment.

In the above scheme, the human voice module is further configured to: acquiring a limiting condition of the application scene to determine a candidate classification process corresponding to the limiting condition in the plurality of candidate classification processes as a classification process adapted to the application scene; wherein the defined condition comprises at least one of: age; a species; the language type; sex.

In the above solution, the voice classification processing and the language classification processing are implemented by a multi-classification task model, where the multi-classification task model includes a shared feature network, a voice classification network, and a language classification network; the device further comprises: a training module to: performing forward propagation and backward propagation on the corpus samples in the training sample set in a shared full-connection layer of the shared feature network, the human voice classification network and the language classification network and a full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer; and carrying out forward propagation and backward propagation on the corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the voice full-connection layer of the voice classification network and the language full-connection layer of the language classification network so as to update the parameters of the multi-classification task model.

In the foregoing solution, the result module is further configured to: acquiring a first number of pronunciation segments of which the human voice classification result is non-human voice and a second number of pronunciation segments of which the human voice classification result is human voice; determining a human voice classification result corresponding to the larger number of the first number and the second number as a human voice classification result of the audio signal; acquiring the language classification result as the number of pronunciation fragments of each language; and determining the language corresponding to the maximum number as the language classification result of the audio signal.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the artificial intelligence-based voice detection method provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions and is used for implementing the artificial intelligence based voice detection method provided by the embodiment of the present application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

by extracting the characteristics of each pronunciation segment in the audio signal and carrying out human voice classification processing and language classification processing aiming at the extracted audio characteristics, various anomalies existing in the audio signal are accurately detected, and the voice recognition is more accurately realized.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence-based speech detection system provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

3A-3D are schematic flow charts of artificial intelligence based speech detection methods provided by embodiments of the present application;

4A-4B are schematic interface diagrams of artificial intelligence based speech detection methods provided by embodiments of the present application;

FIG. 5 is a flowchart illustrating an artificial intelligence based speech detection method according to an embodiment of the present application;

FIG. 6A is a schematic structural diagram of a multi-classification task model of an artificial intelligence based speech detection method provided by an embodiment of the present application;

FIG. 6B is a schematic structural diagram of a basic classification model of an artificial intelligence-based speech detection method provided in an embodiment of the present application;

fig. 7 is a schematic data structure diagram of a speech detection method based on artificial intelligence according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) The voice recognition technology comprises the following steps: automatic Speech Recognition (ASR) aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences.

2) Mel Frequency Cepstrum Coefficient (MFCC, Mel-Frequency Cepstrum coeffient): is a cepstrum parameter extracted in the frequency domain of the mel scale, the mel scale describes the non-linear characteristic of human ear frequency, and the mel spectrogram is a spectrogram converted from frequency into the mel scale.

3) Identity authentication Vector (I-Vector, Identity Vector): the voice features are extracted into a low-dimensional vector matrix which is used for representing the information difference of the speaker.

4) Voice endpoint Detection (VAD): a voiced segment and a mute segment of an audio signal are detected.

5) Full Connection (FC): a fully connected layer may integrate local information with class distinction in convolutional or pooling layers.

One of typical applications of a voice interaction function in the related art is a spoken language evaluation application scenario, where spoken language evaluation is a process of evaluating a voice of a speaker, first performing voice recognition, and then performing evaluation based on characteristics such as pronunciation confidence extracted by the voice recognition, and in order to improve accuracy of the evaluation, a language of the voice recognition needs to be consistent with a language to be evaluated, for example, for spoken language evaluation in chinese, an adopted voice recognition engine should be a chinese recognition engine, but in the embodiment of the present application, it is found that scenes of the spoken language evaluation are diverse, for example, a speaker does not speak a corresponding evaluation language, i.e., evaluates chinese but a speaker speaks english, or randomly records non-human voice frequency, such as animal voice, table tap voice, keyboard voice, and the like, for evaluation, and these abnormal situations reduce robustness of the spoken language evaluation, and therefore, an audio signal needs to be subjected to abnormal detection before the evaluation, so as to reduce the influence of abnormal audio signals on the evaluation accuracy.

In the related art, the process for language discrimination and the process for non-human voice discrimination are independent, and the applicant finds that in a scene applying a voice interaction function, such as a spoken language evaluation scene, when the speech interaction function is applied, both the situations that the language of an audio signal does not conform to a specification and the non-human voice of the audio signal belong to abnormal situations, the accuracy and the real-time performance of the voice interaction are affected, and the language discrimination is also one of the human voice in the category of the non-human voice discrimination, and the non-human voice discrimination and the language discrimination are related, and only the non-human voice discrimination or only the language discrimination is performed, so that the comprehensive abnormal situations cannot be effectively detected.

The embodiment of the application provides a voice detection method, a voice detection device, electronic equipment and a computer readable storage medium based on artificial intelligence, which can combine a language classification task and a non-human voice classification task, extract audio features effective for the two tasks, simultaneously optimize the two tasks based on multi-task learning, and simultaneously output a language classification result and a human voice classification result, so that the accuracy and the real-time performance of voice interaction are improved. In the following, an exemplary application will be explained when the device is implemented as a terminal.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence-based speech detection system provided in an embodiment of the present application, in order to support a spoken language evaluation application, a terminal 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two. The server 200 receives the audio signal of the user answer question sent by the terminal 400, and performs the voice classification processing and the language classification processing on the audio signal at the same time, and when at least one of the voice classification result and the language classification result is abnormal, the server 200 returns the abnormal classification result to the terminal 400 for displaying.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In some embodiments, in a spoken language evaluation scenario, an audio signal to be classified is an audio signal of a question answered by a user, in response to a voice acquisition operation of the user, the terminal 400 receives the audio signal of a topic type read after the user, the language set for the topic type read after the user is english, the terminal 400 sends the audio signal (i.e., an answer of the user to the topic type read after the user) to the server 200, the server 200 performs voice classification processing and language classification processing on the audio signal, and when the voice classification result is a non-voice or the language classification result is not english, the obtained classification result (the non-voice or the voice and the non-english) representing that the audio signal is abnormal is returned to the terminal 400 to prompt the user to do the next time.

In some embodiments, in the scenario of the intelligent voice assistant, the audio signal to be classified is an audio signal for a user to wake up the intelligent voice assistant, in response to a voice acquisition operation of the user, the terminal 400 receives the audio signal for the user to wake up the intelligent voice assistant, the terminal 400 sends the audio signal to the server 200, the server 200 performs voice classification processing and language classification processing on the audio signal, and when a voice classification result is voice and a language classification result is english, the virtual image of the intelligent voice assistant corresponding to the classification result is returned and presented on the terminal 400, and the intelligent voice assistant is controlled to interact with the user in an english voice manner.

In some embodiments, in a speech input scenario, an audio signal to be classified is an audio signal input by a user, in response to a speech acquisition operation of the user, a speech input language set by the terminal 400 is chinese, the terminal 400 receives the audio signal input by the user, the terminal 400 sends the audio signal to the server 200, the server 200 performs speech classification processing and language classification processing on the audio signal, and when a speech classification result is non-speech or the language classification result is not chinese, the obtained classification result (non-speech, or speech and non-chinese) representing that the audio signal is abnormal is returned to the terminal 400 to prompt the user to perform speech input again, so as to complete a speech input process.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, taking the electronic device as a server 200 as an example, the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

in some embodiments, the artificial intelligence based speech detection apparatus provided by the embodiments of the present application can be implemented in software, and fig. 2 shows an artificial intelligence based speech detection apparatus 255 stored in a memory 250, which can be software in the form of programs and plug-ins, etc., and includes the following software modules: an acquisition module 2551, a voice module 2552, a language module 2553, a result module 2554 and a training module 2555, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, which will be described below.

The artificial intelligence based speech detection method provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server 200 provided by the embodiment of the present application.

Referring to fig. 6A, fig. 6A is a schematic structural diagram of a multi-classification task model of a speech detection method based on artificial intelligence according to an embodiment of the present application, where the multi-classification task model includes a shared feature network, a voice classification network, and a language classification network, the shared feature network is used to perform feature extraction, an input of the shared feature network is a mel spectrum obtained based on an audio signal, an output of the shared feature network is an audio feature of the audio signal, the audio feature is fully connected through a shared full-connection layer of the voice classification network and the language classification network, and then the voice classification result and the language classification result are respectively obtained by fully connecting full-connection layers corresponding to the voice classification network and the language classification network, and the voice classification network includes the shared full-connection layer and a voice full-connection layer corresponding to the voice classification network, the language classification network comprises the sharing full connection layer and a language full connection layer corresponding to the language classification network.

Referring to fig. 6B, fig. 6B is a schematic structural diagram of a basic classification model of a speech detection method based on artificial intelligence provided in the embodiment of the present application, where the basic classification model includes a plurality of feature extraction networks, a shared full-link layer (FC2048 and a linear rectification function), and full-link layers (FC527 and a sigmoid activation function) corresponding to 527 classes, each feature extraction network includes a convolution layer (e.g., convolution layer 3 x 3@64), a normalization layer, a linear rectification layer, and an average pooling layer, the shared full-link layer is a shared full-link layer corresponding to the human-voice classification network and the language classification network, the plurality of feature extraction networks are combined into the shared feature network, and the full-link layer corresponding to 527 classes can directly output 527 classification results to perform visual training on the basic classification model.

Referring to fig. 3A, fig. 3A is a schematic flowchart of a speech detection method based on artificial intelligence according to an embodiment of the present application, which will be described with reference to steps 101 and 104 shown in fig. 3A.

In step 101, an audio signal is divided into a plurality of pronunciation segments, and an audio feature of each pronunciation segment is obtained.

As an example, in a spoken language evaluation scenario, the audio signal is obtained by collecting audio content of a user answering a question, in an intelligent assistant scenario, the audio signal is obtained by collecting audio content carrying a user instruction, and in a speech input scenario, the audio signal is obtained by collecting audio content carrying a user input text.

In some embodiments, the audio signal is divided into a plurality of pronunciation segments in step 101, which may be implemented by the following technical solutions: determining a speech energy of each audio frame in the audio signal; and combining a plurality of continuous audio frames with the speech energy larger than the background noise energy in the audio signal into a pronunciation segment.

As an example, the strength of the audio signal is detected based on an energy criterion, when the speech energy of an audio frame in the audio signal is greater than the background noise energy, the audio frame is determined to be speech-present, and when the speech energy of the audio frame in the audio signal is not greater than the background noise energy, the audio frame is determined to be speech-absent, e.g., the audio frame is background noise.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating a speech detection method based on artificial intelligence provided in an embodiment of the present application, and the dividing of the audio signal into a plurality of pronunciation segments in step 101 can be implemented in

steps

1011 and 1013.

In step 1011, the audio signal is subjected to framing processing to obtain a plurality of audio frames corresponding to the audio signal.

In step 1012, feature extraction is performed on each audio frame through the audio frame classification network to obtain an audio frame classification feature corresponding to each audio frame.

In step 1013, performing a classification process based on audio frame classification features on each audio frame through an audio frame classification network, and combining a plurality of consecutive audio frames of which the classification results are pronunciation data into pronunciation segments;

as an example, the audio frame classification features include at least one of: log frame energy features; a zero-crossing rate characteristic; normalized auto-correlation features. The training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeling classification results of the audio frame samples.

As an example, an audio signal is subjected to framing processing, audio frame classification features are extracted from data of each audio frame, an audio frame classification network is trained on an audio frame set with a known speech signal region and a silence signal region, an unknown audio frame is classified through the trained audio frame classification network to determine whether the audio frame belongs to a speech signal or a silence signal, the audio frame classification network divides the audio signal into a pronunciation segment and an unvoiced segment, the audio signal is firstly passed through a high-pass filter to remove a direct current offset component and low-frequency noise in the audio signal, framing with a length of 20-40 milliseconds (ms) is performed on the audio signal before feature extraction is performed, an overlap between the audio frame and the audio frame is 10ms, and after the framing is completed, extraction of at least one of the following three features is performed on each audio frame: log frame energy features; a zero-crossing rate characteristic; normalized auto-correlation features. By combining various characteristics, the probability of audio frame error classification can be effectively reduced, and the accuracy of voice recognition is further improved.

In some embodiments, referring to fig. 3C, fig. 3C is a flowchart illustrating a speech detection method based on artificial intelligence provided in an embodiment of the present application, and the obtaining of the audio feature of each pronunciation segment in step 101 can be implemented in step 1014-.

In step 1014, the type of each utterance section is transformed from the time-domain signal to the frequency-domain signal, and a mel calculation is performed on each utterance section transformed to the frequency-domain signal, so as to obtain a mel-scaled frequency spectrum of each utterance section.

In step 1015, the mel-scale frequency spectrum of each pronunciation segment is transmitted forward in the shared feature network to obtain the audio features corresponding to each pronunciation segment.

As an example, the audio features of each pronunciation segment are obtained through a shared feature network in a multi-classification task model. The original audio signal is a oscillogram which changes along with time and cannot be decomposed into a plurality of basic signals, so that the original audio signal is converted from a time domain to a frequency domain to obtain a spectrogram, the audio signal is converted from the time domain to the frequency domain through Fourier transform, the horizontal axis of the spectrogram is time, the vertical axis of the spectrogram is frequency, the human being difficultly perceives the frequency in a linear range, the low-frequency difference perception capability is stronger than the high-frequency difference perception capability, in order to overcome the perception difficulty, the frequency can be subjected to Mel calculation, the pronunciation segment converted into the frequency domain signal is subjected to Mel calculation to obtain Mel scales, finally, the original audio signal is converted into a Mel scale spectrum, the horizontal axis of the Mel scale spectrum is time, the vertical axis of the Mel scales is frequency, and the Mel scale spectrum is used as the input of the multi-classification task model.

In some embodiments, the shared feature network comprises N cascaded feature extraction networks, N being an integer greater than or equal to 2; in step 1015, forward transmitting the mel-scale frequency spectrum of each pronunciation segment in the shared feature network to obtain the audio features corresponding to each pronunciation segment, which can be realized by performing feature extraction processing on the input of the nth feature extraction network through the nth feature extraction network of the N cascaded feature extraction networks; transmitting the nth feature extraction result output by the nth feature extraction network to the (n + 1) th feature extraction network to continue feature extraction processing; wherein N is an integer with the value increasing from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; and when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation segment, and when the value of N is more than or equal to 2 and less than or equal to N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

As an example, referring to fig. 6B, the basic classification model includes a plurality of feature extraction networks, a shared full-link layer (FC2048 and linear rectification function), which is a full-link layer shared between the human voice classification network and the language classification network, and a full-link layer (FC527 and sigmoid activation function) corresponding to 527 classes, and the plurality of feature extraction networks constitute a shared feature network. The input of the shared feature network is the frequency spectrum of the Mel-scale of each pronunciation segment, and the output of the shared feature network is the audio feature.

In some embodiments, the nth feature extraction network comprises a convolutional layer, a normalization layer, a linear rectification layer, and an average pooling layer; the method for extracting the features of the input of the nth feature extraction network through the nth feature extraction network in the N cascaded feature extraction networks can be realized by the following technical scheme that the input of the nth feature extraction network and the parameter of the convolutional layer of the nth feature extraction network are subjected to convolution processing to obtain an nth convolutional layer processing result; normalizing the nth convolution layer processing result through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth feature extraction network to obtain an nth linear rectification processing result; and carrying out average pooling on the nth linear rectification processing result through an average pooling layer of the nth feature extraction network to obtain an nth feature extraction result.

As an example, each feature extraction network includes a convolutional layer, a normalization layer, a linear rectification layer, and an average pooling layer; and performing convolution processing, normalization processing, linear rectification processing and average pooling processing on the input of the feature extraction network through the feature extraction network to obtain a feature extraction result output by the feature extraction network, and finally outputting the audio features of the pronunciation segments by the feature extraction network.

In step 102, based on the audio features of each pronunciation segment, a human voice classification process is performed on each pronunciation segment to obtain a human voice classification result of each pronunciation segment.

In some embodiments, referring to fig. 3D, fig. 3D is a flowchart of the speech detection method based on artificial intelligence provided in the embodiment of the present application, and in step 102, based on the audio feature of each pronunciation segment, a vocal classification process is performed on each pronunciation segment to obtain a vocal classification result of each pronunciation segment, which can be implemented in steps 1021 and 1022.

In step 1021, adaptation of a plurality of candidate classification processes is performed based on the application scenario of the audio signal.

In step 1022, when the human voice classification process is adapted to the multiple candidate classification processes, the human voice classification process is performed on each pronunciation segment to obtain a human voice classification result of each pronunciation segment.

As an example, based on an application scenario of an audio signal, a plurality of candidate classification processes are adapted, for example, when the application scenario is a spoken evaluation scenario, the plurality of candidate classification processes are adapted, and the candidate classification processes include: the method comprises the following steps of voice classification processing, age classification processing, gender classification processing, language classification processing and the like, wherein in a spoken language evaluation scene, an audio signal is required to be voice, and in an intelligent assistant scene, the audio signal is required to be voice, so that when the voice classification processing is adapted to a plurality of candidate classification processing, the voice classification processing is carried out on each pronunciation segment, and a voice classification result of each pronunciation segment is obtained.

In some embodiments, in step 102, based on the audio characteristic of each pronunciation segment, a human voice classification process is performed on each pronunciation segment to obtain a human voice classification result of each pronunciation segment, which may be implemented by performing forward transmission on the audio characteristic of each pronunciation segment in a human voice classification network to obtain a human voice classification result of each pronunciation segment.

In some embodiments, the foregoing forward transmission of the audio features of each pronunciation segment in the human voice classification network to obtain the human voice classification result of each pronunciation segment may be implemented by performing a first full-link processing on each pronunciation segment through a shared full-link layer of the human voice classification network and the language classification network to obtain a first full-link processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of a voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each human voice classification label; and determining the voice classification label with the highest probability as the voice classification result of each pronunciation segment.

As an example, the human voice classification processing and the language classification processing are realized by a multi-classification task model, and the multi-classification task model comprises a human voice classification network and a language classification network; referring to fig. 6A, the multi-classification task model includes a shared feature network, a voice classification network, and a language classification network, the shared feature network is used for feature extraction, the input of the shared feature network is mel frequency spectrum obtained based on audio signal, the output of the shared feature network is audio feature of each pronunciation segment, the audio feature is processed by a first full connection through a shared full connection layer of the voice classification network and the language classification network, the shared full connection layer is as shown in fig. 6B, the full connection layer is further processed based on linear rectification function, and then the voice full connection layer corresponding to the voice classification network is processed by a second full connection and maximum likelihood, the probability of each voice classification label can be obtained through maximum likelihood processing, there are two voice classification labels (voice and non-voice), the voice classification label with the maximum probability is determined as the voice classification result of each pronunciation segment, if the probability of the non-human voice is 0.9 and the probability of the human voice is 0.1, the human voice classification result is the non-human voice.

In step 103, language classification processing is performed on each pronunciation segment based on the audio features of each pronunciation segment, so as to obtain a language classification result of each pronunciation segment.

In some embodiments, in step 103, based on the audio feature of each pronunciation segment, performing language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment, which may be implemented by the following technical solution: performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal; and when the language classification processing is adapted to the language classification processing in the multiple candidate classification processing, performing the language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment.

As an example, based on an application scenario of an audio signal, a plurality of candidate classification processes are adapted, for example, when the application scenario is a spoken evaluation scenario, the plurality of candidate classification processes are adapted, and the candidate classification processes include: the method comprises the following steps of language classification processing, age classification processing, gender classification processing, language classification processing and the like, wherein in a spoken language evaluation scene, the language of an audio signal is required to be English, and in an intelligent assistant scene, the audio signal is required to be Chinese, so that when the speech classification processing is adapted to a plurality of candidate classification processing, the language classification processing is carried out on each pronunciation segment, and the language classification result of each pronunciation segment is obtained.

In some embodiments, the adaptation of the multiple candidate classification processes based on the application scenario of the audio signal may be implemented by obtaining a limiting condition of the application scenario, so as to determine a candidate classification process corresponding to the limiting condition in the multiple candidate classification processes as a classification process adapted to the application scenario; wherein the defined conditions include at least one of: age; a species; the language type; sex.

As an example, different application scenarios have different limiting conditions, for example, the spoken language assessment application scenario has a limitation on the age of the user, for example, if the user who is required to participate in the spoken language assessment is a child, the language classification process, the age classification process, the gender classification process, and the age classification process corresponding to the limiting conditions of the age in the language classification process are used as the classification process adapted to the application scenario.

In some embodiments, in step 103, based on the audio feature of each pronunciation segment, language classification processing is performed on each pronunciation segment to obtain a language classification result of each pronunciation segment, which may be implemented by performing forward transmission on the audio feature of each pronunciation segment in a language classification network to obtain a language classification result of each pronunciation segment.

In some embodiments, the foregoing forward transmission of the audio features of each pronunciation segment in the language classification network to obtain the language classification result of each pronunciation segment may be implemented by the following technical solutions: performing third full-connection processing on each pronunciation segment through a language classification network and a sharing full-connection layer of the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through a language full-connection layer of a language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each language classification label; and determining the language classification label with the highest probability as the language classification result of each pronunciation segment.

As an example, the human voice classification processing and the language classification processing are realized by a multi-classification task model, and the multi-classification task model comprises a human voice classification network and a language classification network; referring to fig. 6A, the multi-classification task model includes a shared feature network, a human voice classification network, and a language classification network, where the shared feature network is used to perform feature extraction, the input of the shared feature network is mel-frequency spectrum obtained based on audio signals, the output of the shared feature network is audio features of each pronunciation segment, the audio features are subjected to a third full-connection processing through a shared full-connection layer of the human voice classification network and the language classification network, the shared full-connection layer is a full-connection layer having 2048 nodes in fig. 6B, a processing based on a linear rectification activation function is further performed in the full-connection layer, a fourth full-connection processing and a maximum likelihood processing are performed through a language full-connection layer of the corresponding language classification network, a probability of each language classification label can be obtained through the maximum likelihood processing, and multiple language classification labels exist (for example, english, chinese, and japanese), the language classification label with the highest probability is determined as the language classification result of each pronunciation fragment, and assuming that the probability of english is 0.8, the probability of chinese is 0.1, and the probability of japanese is 0.1, the language classification result is english.

In step 104, a human voice classification result of the audio signal is determined based on the human voice classification result of each pronunciation section, and a language classification result of the audio signal is determined based on the language classification result of each pronunciation section.

In some embodiments, the step 104 of determining the human voice classification result of the audio signal based on the human voice classification result of each pronunciation segment may be implemented by obtaining a first number of pronunciation segments of which the human voice classification result is non-human voice and a second number of pronunciation segments of which the human voice classification result is human voice; determining the voice classification result corresponding to the larger number of the first number and the second number as the voice classification result of the audio signal; in step 104, determining a language classification result of the audio signal based on the language classification result of each pronunciation segment may be implemented by the following technical solution: acquiring language classification results as the number of pronunciation fragments of each language; and determining the language corresponding to the maximum number as the language classification result of the audio signal.

As an example, if the audio signal is divided into 10 utterance sections, 8 utterance sections are classified into human voices, and 2 utterance sections are classified into non-human voices, the human voice classification result of the audio signal is human voices, the audio signal is divided into 10 utterance sections, 8 utterance sections are classified into english, and 2 utterance sections are classified into chinese, and the language classification result of the audio signal is english.

As an example, when the human voice classification result is a non-human voice, prompt information is displayed to prompt that the audio signal belongs to an abnormal signal; when the human voice classification result is non-human voice, the following processing is executed: when the client receiving the audio signal belongs to an intelligent voice control scene, displaying an intelligent voice assistant corresponding to the language classification result; and when the client receiving the audio signal belongs to the voice test scene and the language classification result does not accord with the set language, displaying prompt information to prompt that the audio signal belongs to the abnormal signal.

In some embodiments, the voice classification processing and the language classification processing are realized by a multi-classification task model, and the multi-classification task model comprises a shared feature network, a voice classification network and a language classification network; performing forward propagation and backward propagation on the corpus samples in the training sample set in a shared full-connection layer of a shared characteristic network, a human voice classification network and a language classification network and a full-connection layer corresponding to the shared characteristic network so as to update parameters of the shared characteristic network and the shared full-connection layer; and performing forward propagation and backward propagation on the corpus samples in the training sample set in the updated shared feature network, the updated shared full-link layer, the voice full-link layer of the voice classification network and the language full-link layer of the language classification network to update the parameters of the multi-classification task model.

As an example, referring to fig. 6B, a basic classification model is first trained, the basic classification model includes a plurality of feature extraction networks (shared feature networks) and two full-connection layers, the first full-connection layer is a shared full-connection layer with 2048 nodes (shared full-connection layer corresponding to the human-voice classification network and the language classification network), in which a linear rectification process can be implemented, the second full-connection layer is a full-connection layer (full-connection layer corresponding to the shared feature network) for 527 speech-type classifications, in which a maximum likelihood process can be implemented, a visual update of the basic classification model can be implemented through the second full-connection layer, after the training of the basic classification model is completed, the shared feature networks and the shared full-connection layers of the basic classification model are retained, and the human-voice full-connection layer of the human-voice classification network and the language full-connection layer of the language classification network are added on the basis of the retained networks, and obtaining a multi-classification task model, and continuously training the multi-classification task model.

By way of example, the following processing is performed during each iterative training process of the multi-classification task model: forward propagating each corpus sample in a shared feature network of a multi-classification task model and a voice classification network to obtain a corresponding predicted voice classification category when voice classification processing is carried out on the corpus sample; forward propagating each corpus sample in a shared feature network and a language classification network of a multi-classification task model to obtain a corresponding predicted language classification category when the corpus sample is subjected to language classification processing; determining a human voice error between the predicted human voice classification category and the pre-labeled human voice real category and a language error between the predicted language classification category and the pre-labeled language real category; and aggregating the language errors and the voice errors according to the loss function to obtain an aggregation error, reversely propagating the aggregation error in the multi-classification task model to determine the parameter change value of the multi-classification task model when the loss function obtains the minimum value, and updating the parameters of the multi-classification task model based on the parameter change value.

In the following, an exemplary application of the artificial intelligence based voice detection method provided by the embodiment of the present application in a spoken language test scenario taking an application scenario as an example will be described.

For the language classification processing, the following two schemes can be adopted: 1. selecting a language corresponding to the speech recognition engine with the maximum output probability as a recognition language based on a plurality of speech recognition engines; 2. extracting effective pronunciation characteristics to construct a language classifier to judge the language, extracting the effective pronunciation characteristics based on professional knowledge when extracting the effective pronunciation characteristics, and extracting the effective characteristics of audio based on a neural network, for example, extracting the characteristics such as Mel frequency cepstrum coefficient and identity authentication vector to classify the language; inputting an original audio waveform signal into a deep neural network, and outputting a language classification result; extracting an original spectrogram corresponding to the voice, inputting the original spectrogram into a deep neural network, and outputting language classification results; aiming at human voice classification processing, classifiers of various sounds can be constructed, and the various sounds are classified based on the extracted voice spectrogram.

In the scene of the oral test, the voice interaction function is mainly applied to the reading following question type or the open expression question type of the oral test.

For example, referring to fig. 4A, fig. 4A is an interface schematic diagram of an artificial intelligence based voice detection method provided in an embodiment of the present application, and a reading following text "i know about true phase, do you know? (I click the fact, do you click) ", receives an audio signal of the user reading the text in response to a trigger operation, e.g., a click operation, for starting the reading-aloud button 502A in the human-computer interaction interface 501A, and stops receiving the audio signal of the user reading the text in response to a trigger operation, e.g., a click operation, for ending the reading-aloud button 503A in the human-computer interaction interface 501A. Referring to fig. 4B, fig. 4B is an interface schematic diagram of the artificial intelligence based speech detection method according to the embodiment of the present application, and an anomaly detection result of an audio signal, for example, a non-english anomaly detection result, is presented on the human-computer interaction interface 501B.

Referring to fig. 5, fig. 5 is a schematic flow chart of a speech detection method based on artificial intelligence according to an embodiment of the present application, in response to initialization of a client, a reading-after text is displayed in a human-computer interaction interface of the client, in response to a recording start operation for starting a reading-aloud button in the client, an audio signal during text reciting by a user is collected, the client sends the collected audio signal to a server, the server sends the audio signal to an abnormality detection module, the abnormality detection module outputs a human voice classification result and a language classification result of whether the result is non-human voice, and then returns the result to the server, when the result is non-human voice classification or a language classification result unrelated to current spoken language evaluation, the server returns the abnormality detection result to the client to remind the user, otherwise, the server returns an evaluation result for spoken language evaluation to the client.

In some embodiments, in the language classification process, the audio signal may include at least one language, a segment of the audio signal is divided into a plurality of pronunciation segments, a situation that the segment of the audio signal includes at least one language can be effectively solved, and a speech endpoint detection technique may be used to detect the audio signal, for example, to determine whether each frame of audio in the audio signal is muted, or to determine whether any frame of audio belongs to the audio signal or the muted signal.

In some embodiments, since the original audio signal is a time-varying waveform graph and cannot be decomposed into a plurality of basic signals, the signal is transformed from the time domain to the frequency domain by fourier transform, and the signal is transformed into a spectrogram by fourier transform, wherein the horizontal axis of the spectrogram is time and the vertical axis of the spectrogram is frequency, and since a human does not perceive the frequency in a linear range, the capability of perceiving low-frequency differences is stronger than the capability of perceiving high-frequency differences, mel calculation can be performed on the frequency, the frequency is converted into a mel scale, and finally, the original signal is transformed into the mel spectrogram, wherein the horizontal axis of the mel spectrogram is time and the vertical axis is frequency of the mel scale, and the mel spectrum is used as an input of the multi-classification task model.

In some embodiments, the non-human voice detection and the language detection are performed separately by a basic classification model (pre-trained), the pre-trained basic classification model is an audio classification network, which may be a convolutional neural network obtained based on audio training and has the capability of classifying 527 audio types, the basic structure of the basic classification model is as shown in fig. 6B, the network structure of the basic classification model is input and output from top to bottom, each unit of the basic classification model is composed of a convolutional neural network, Batch Normalization (BN), a linear rectification function (ReLU), and average Pooling (Pooling), and finally 527 audio types are classified through Global Pooling (Global Pooling) and two full-connected transforms (FC2048 and FC 527).

In some embodiments, referring to fig. 6A, a multi-classification task model is obtained by performing migration learning based on a trained basic classification model, so that it can perform voice classification processing and language classification processing, specifically, a last full-link layer (including FC527 and sigmoid activation function) in the basic classification model is replaced with two required independent full-link layers (including FC and maximum likelihood function), and two classification results are finally output, including a voice classification result of whether it is a voice and a language classification result.

In some embodiments, the loss function of the multi-classification task model is divided into two parts, including a loss part for the human voice classification and a loss part for the language classification, and the loss function is obtained by overlapping the loss parts of the two classifications, see formula (1),

L_total＝w₁*L_{human voice}+w₂*L_{Language kind} (1)；

Wherein L is_totalIs a loss function of a multi-classification task model, w₁Is toParameters of loss for vocal classification, w₂Is a predetermined parameter, w, of the loss for language classification₁And w₂For balancing the losses of the two loss parts, L_{Human voice}Is directed to the loss of the classification of human voice, L_{Language kind}Is directed to the loss of language classification.

The loss of the human voice classification processing and the language classification processing is shown in formula (2), wherein L is the loss of the human voice classification or the loss of the language classification, y is whether an audio signal is a real tag of human voice or a real tag of language, and P is the prediction probability of a human voice classification result or a prediction probability of a language classification result output by the multi-classification task model.

L＝-y*log(p) (2)；

Referring to fig. 7, fig. 7 is a schematic data structure diagram of a speech detection method based on artificial intelligence provided in the embodiment of the present application, where an input of a multi-classification task model is a mel spectrum, the multi-classification task model includes a pre-trained speech Neural network (PANN), a full-connected speech layer (FC and maximum likelihood function) of a speech classification network, and outputs two anomaly detection results, including: human voice classification results (0 for human voice, probability 0.9, 1 for non-human voice, and probability 0.1) and language classification results (0 for english, probability 0.2, 1 for non-english, and probability 0.8).

The data test of the artificial intelligence-based voice detection method provided by the embodiment of the application is mainly performed on human voice classification processing and language classification processing, the language classification result is mainly embodied as English language and non-English language, the human voice classification result is mainly embodied as human voice and non-human voice, the data test is mainly performed on words, sentences and paragraph scenes, the test data is from spoken test data of a certain field, and each type of data is 1000 pieces, and the method comprises the following steps: 1000 sentences (500 english words, 500 non-english words), 1000 words (500 english words, 500 non-english words), 1000 paragraphs (500 english words, 500 non-english words), 1000 non-voices (1000 voices), and the accuracy of the classification result is shown in table 1 below.

	Language detection result	Non-human voice detection result
			Word	89％	99％
Sentence	99％	99％
			Paragraph (b)	92％	99％
Non-human voice	--	99％

TABLE 1 test accuracy table

In some embodiments, the language classification network and the human voice classification network may be implemented based on various neural network structures, and more related abnormality detection tasks, such as a childhood sound and adult sound classification task, may also be added to achieve the technical effect of implementing multi-dimensional abnormal condition discrimination based on one model.

Continuing with the exemplary structure of the artificial intelligence based speech detection apparatus 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based speech detection apparatus 255 of the memory 250 may include: an obtaining module 2551, configured to divide the audio signal into a plurality of pronunciation segments, and obtain an audio feature of each pronunciation segment; a voice module 2552, configured to perform voice classification processing on each pronunciation segment based on the audio features of each pronunciation segment, to obtain a voice classification result of each pronunciation segment; a language module 2553, configured to perform language classification processing on each pronunciation segment based on the audio feature of each pronunciation segment, to obtain a language classification result of each pronunciation segment; a result module 2554, configured to determine a human voice classification result of the audio signal based on the human voice classification result of each pronunciation section, and determine a language classification result of the audio signal based on the language classification result of each pronunciation section.

In some embodiments, the obtaining module 2551 is further configured to: determining a speech energy of each audio frame in the audio signal; and combining a plurality of continuous audio frames with the speech energy larger than the background noise energy in the audio signal into a pronunciation segment.

In some embodiments, the obtaining module 2551 is further configured to: performing framing processing on the audio signal to obtain a plurality of audio frames corresponding to the audio signal; performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame; wherein the audio frame classification features include at least one of: log frame energy features; a zero-crossing rate characteristic; normalizing the autocorrelation characteristics; carrying out classification processing based on audio frame classification characteristics on each audio frame through an audio frame classification network, and combining a plurality of continuous audio frames of which the classification results are pronunciation data into pronunciation fragments; the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeling classification results of the audio frame samples.

In some embodiments, the human voice classification processing and the language classification processing are realized by a multi-classification task model, and the multi-classification task model comprises a human voice classification network and a language classification network; the human voice module 2552 is further configured to: carrying out forward transmission on the audio features of each pronunciation segment in a voice classification network to obtain a voice classification result of each pronunciation segment; language module 2553, further configured to: and forward transmitting the audio features of each pronunciation segment in a language classification network to obtain a language classification result of each pronunciation segment.

In some embodiments, the vocal module 2551 is further configured to: performing first full-connection processing on each pronunciation segment through a shared full-connection layer of a human voice classification network and a language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of a voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each human voice classification label; determining the voice classification label with the maximum probability as the voice classification result of each pronunciation segment; language module 2553, further configured to: performing third full-connection processing on each pronunciation segment through a language classification network and a sharing full-connection layer of the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through a language full-connection layer of a language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each language classification label; and determining the language classification label with the highest probability as the language classification result of each pronunciation segment.

In some embodiments, the audio features of each pronunciation segment are obtained through a shared feature network in a multi-classification task model; an obtaining module 2551, further configured to: converting the type of each pronunciation segment from a time domain signal to a frequency domain signal, and carrying out Mel calculation on each pronunciation segment converted to the frequency domain signal to obtain a Mel scale frequency spectrum of each pronunciation segment; and forward transmitting the frequency spectrum of the Mel scale of each pronunciation segment in a shared feature network to obtain the audio features corresponding to each pronunciation segment.

In some embodiments, the shared feature network comprises N cascaded feature extraction networks, N being an integer greater than or equal to 2; an obtaining module 2551, further configured to: performing feature extraction processing on the input of an nth feature extraction network in the N cascaded feature extraction networks; transmitting the nth feature extraction result output by the nth feature extraction network to the (n + 1) th feature extraction network to continue feature extraction processing; wherein N is an integer with the value increasing from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; and when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation segment, and when the value of N is more than or equal to 2 and less than or equal to N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

In some embodiments, the nth feature extraction network comprises a convolutional layer, a normalization layer, a linear rectification layer, and an average pooling layer; an obtaining module 2551, further configured to: performing convolution processing on the input of the nth feature extraction network and the convolution layer parameters of the convolution layer of the nth feature extraction network to obtain an nth convolution layer processing result; normalizing the nth convolution layer processing result through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth feature extraction network to obtain an nth linear rectification processing result; and carrying out average pooling on the nth linear rectification processing result through an average pooling layer of the nth feature extraction network to obtain an nth feature extraction result.

In some embodiments, vocal module 2552 is further configured to: performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal; when the voice classification processing is adapted to the multiple candidate classification processing, performing voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment; language module 2553, further configured to: performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal; and when the language classification processing is adapted to the language classification processing in the multiple candidate classification processing, performing the language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment.

In some embodiments, vocal module 2552 is further configured to: acquiring a limiting condition of an application scene to determine a candidate classification process corresponding to the limiting condition in a plurality of candidate classification processes as a classification process adapted to the application scene; wherein the defined conditions include at least one of: age; a species; the language type; sex.

In some embodiments, the voice classification processing and the language classification processing are realized by a multi-classification task model, and the multi-classification task model comprises a shared feature network, a voice classification network and a language classification network; the device still includes: a training module 2555 to: performing forward propagation and backward propagation on the corpus samples in the training sample set in a shared full-connection layer of a shared characteristic network, a human voice classification network and a language classification network and a full-connection layer corresponding to the shared characteristic network so as to update parameters of the shared characteristic network and the shared full-connection layer; and performing forward propagation and backward propagation on the corpus samples in the training sample set in the updated shared feature network, the updated shared full-link layer, the voice full-link layer of the voice classification network and the language full-link layer of the language classification network to update the parameters of the multi-classification task model.

In some embodiments, results module 2554 is further configured to: acquiring a first number of pronunciation segments of which the human voice classification result is non-human voice and a second number of pronunciation segments of which the human voice classification result is human voice; determining the voice classification result corresponding to the larger number of the first number and the second number as the voice classification result of the audio signal; acquiring language classification results as the number of pronunciation fragments of each language; and determining the language corresponding to the maximum number as the language classification result of the audio signal.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence based voice detection method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform an artificial intelligence based speech detection method provided by embodiments of the present application, for example, the artificial intelligence based speech detection method as shown in fig. 3A-3D.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, by performing feature extraction on each pronunciation segment in the audio signal and performing human voice classification processing and language classification processing on the extracted audio features respectively, the abnormality of the audio signal is accurately detected, and speech recognition is more accurately realized.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An artificial intelligence based speech detection method, the method comprising:

2. The method of claim 1, wherein the dividing the audio signal into a plurality of pronunciation segments comprises:

determining a speech energy of each audio frame in the audio signal;

and combining a plurality of continuous audio frames with the voice energy larger than the background noise energy in the audio signal into a pronunciation segment.

3. The method of claim 1, wherein the dividing the audio signal into a plurality of pronunciation segments comprises:

performing framing processing on the audio signal to obtain a plurality of audio frames corresponding to the audio signal;

performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame;

wherein the audio frame classification features include at least one of: log frame energy features; a zero-crossing rate characteristic; normalizing the autocorrelation characteristics;

performing classification processing based on the audio frame classification characteristics on each audio frame through the audio frame classification network, and combining a plurality of continuous audio frames of which the classification results are pronunciation data into pronunciation fragments;

the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeling classification results of the audio frame samples.

4. The method of claim 1,

the human voice classification processing and the language classification processing are realized through a multi-classification task model, and the multi-classification task model comprises a human voice classification network and a language classification network;

the method for classifying the human voice of each pronunciation segment based on the audio features of each pronunciation segment to obtain the human voice classification result of each pronunciation segment includes:

carrying out forward transmission on the audio features of each pronunciation segment in the voice classification network to obtain a voice classification result of each pronunciation segment;

the method for classifying the languages of the pronunciation segments based on the audio features of the pronunciation segments to obtain the language classification result of each pronunciation segment includes:

and carrying out forward transmission on the audio features of each pronunciation segment in the language classification network to obtain a language classification result of each pronunciation segment.

5. The method of claim 4,

the forward transmission of the audio features of each pronunciation segment in the human voice classification network to obtain the human voice classification result of each pronunciation segment includes:

performing first full-connection processing on each pronunciation segment through a shared full-connection layer of the human voice classification network and the language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment;

performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment;

carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each human voice classification label;

determining the voice classification label with the highest probability as the voice classification result of each pronunciation segment;

the forward transmission of the audio features of each pronunciation segment in the language classification network to obtain the language classification result of each pronunciation segment includes:

performing third full-connection processing on each pronunciation segment through the language classification network and a shared full-connection layer of the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment;

performing fourth full-link processing on the third full-link processing result of each pronunciation segment through a language full-link layer of the language classification network to obtain a fourth full-link processing result of each pronunciation segment;

carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation fragment to obtain the probability corresponding to each language classification label;

and determining the language classification label with the maximum probability as the language classification result of each pronunciation segment.

6. The method of claim 1,

the audio features of each pronunciation segment are obtained through a shared feature network in a multi-classification task model;

the obtaining of the audio features of each pronunciation segment includes:

converting the type of each pronunciation segment from a time domain signal to a frequency domain signal, and performing Mel calculation on each pronunciation segment converted to the frequency domain signal to obtain a frequency spectrum of Mel scales of each pronunciation segment;

and forward transmitting the frequency spectrum of the Mel scale of each pronunciation segment in the shared feature network to obtain the audio features corresponding to each pronunciation segment.

7. The method of claim 6,

the shared feature network comprises N cascaded feature extraction networks, wherein N is an integer greater than or equal to 2;

the forward transmission of the spectrum of the mel scale of each pronunciation segment in the shared feature network to obtain the audio features corresponding to each pronunciation segment includes:

performing feature extraction processing on input of an nth feature extraction network in N cascaded feature extraction networks;

transmitting the nth feature extraction result output by the nth feature extraction network to an n +1 th feature extraction network to continue feature extraction processing;

wherein N is an integer with the value increasing from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; and when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation segment, and when the value of N is not less than 2 and not more than N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

8. The method of claim 7,

the nth feature extraction network comprises a convolution layer, a normalization layer, a linear rectification layer and an average pooling layer;

the feature extraction processing of the input of the nth feature extraction network is performed through the nth feature extraction network of the N cascaded feature extraction networks, and includes:

performing convolution processing on the input of the nth feature extraction network and convolution layer parameters of a convolution layer of the nth feature extraction network to obtain an nth convolution layer processing result;

normalizing the nth convolution layer processing result through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result;

performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth feature extraction network to obtain an nth linear rectification processing result;

and carrying out average pooling on the nth linear rectification processing result through an average pooling layer of the nth feature extraction network to obtain an nth feature extraction result.

9. The method of claim 1,

performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal;

when the voice classification processing is adapted to the multiple candidate classification processing, performing voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment;

and when the language classification processing is adapted to the language classification processing in the candidate classification processing, performing language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment.

10. The method of claim 9, wherein the adapting a plurality of candidate classification processes based on the application scenario of the audio signal comprises:

acquiring a limiting condition of the application scene to determine a candidate classification process corresponding to the limiting condition in the plurality of candidate classification processes as a classification process adapted to the application scene;

wherein the defined condition comprises at least one of: age; a species; the language type; sex.

11. The method of claim 1,

the voice classification processing and the language classification processing are realized through a multi-classification task model, and the multi-classification task model comprises a shared feature network, a voice classification network and a language classification network;

the method further comprises the following steps:

performing forward propagation and backward propagation on the corpus samples in the training sample set in a shared full-connection layer of the shared feature network, the human voice classification network and the language classification network and a full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer;

and carrying out forward propagation and backward propagation on the corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the voice full-connection layer of the voice classification network and the language full-connection layer of the language classification network so as to update the parameters of the multi-classification task model.

12. The method of claim 1,

the determining the human voice classification result of the audio signal based on the human voice classification result of each pronunciation section comprises:

acquiring a first number of pronunciation segments of which the human voice classification result is non-human voice and a second number of pronunciation segments of which the human voice classification result is human voice;

determining a human voice classification result corresponding to the larger number of the first number and the second number as a human voice classification result of the audio signal;

the determining a language classification result of the audio signal based on the language classification result of each pronunciation section includes:

acquiring the language classification result as the number of pronunciation fragments of each language;

and determining the language corresponding to the maximum number as the language classification result of the audio signal.

13. A speech detection device based on artificial intelligence, comprising:

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based speech detection method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based speech detection method of any one of claims 1 to 12 when executed by a processor.