CN116913266B

CN116913266B - Voice detection method, device, equipment and storage medium

Info

Publication number: CN116913266B
Application number: CN202311179043.6A
Authority: CN
Inventors: 王雄
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-01-05
Anticipated expiration: 2043-09-13
Also published as: CN116913266A

Abstract

The embodiment of the application provides a voice detection method, a device, equipment and a storage medium, relates to the technical field of artificial intelligence, and can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and comprises the following steps: extracting characteristics of voice data to be detected, and storing the obtained initial voice characteristics into a designated storage area; extracting N historical voice features obtained in a specified historical stage, wherein N is a positive integer; sequencing the initial voice features and N historical voice features according to the acquisition time sequence of corresponding voice data, and extracting target context features of the initial voice features contained in the acquired voice feature sequence; target context feature characterization: semantic relationships between the initial speech feature and the N historical speech features; based on the target context characteristics, target keywords of target classification of the voice data to be detected are obtained. The method is used for reducing the module memory occupancy rate and consumed computing resources.

Description

Voice detection method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting voice.

Background

Keyword detection (Spoken Term Detection) is a sub-field of the voice recognition field, and mainly detects specific keywords or phrases from continuous voice signals, and is widely applied to intelligent devices such as intelligent home, intelligent sound boxes and intelligent mobile phones, so that the intelligent devices can analyze and detect keywords in voice after receiving the voice, activate voice interaction between the intelligent devices and a user object according to the keywords, and execute corresponding flows of voice instructions. Therefore, the voice detection method running in the intelligent device needs to have higher accuracy and real-time performance so as to ensure that the intelligent device is timely and accurately activated and executes voice instructions. However, in practical applications, the voice data is transmitted in units of frames, and the data size of the voice frames is small, so that in order to make the intelligent device respond to the voice at the highest speed, it is necessary to perform voice detection while receiving the voice frames, so as to ensure the real-time performance of voice detection.

In the related art, a general detection flow of a voice detection method is as follows: after receiving a plurality of current voice frames each time, acquiring a plurality of stored historical voice frames, and taking the plurality of current voice frames and the plurality of historical voice frames as input data of a voice detection model, so that the voice detection model combines context information in the historical voice frames to detect and obtain keywords, and the intelligent equipment executes a voice instruction related flow. However, in the method, each time the voice detection model receives the current voice frame, not only the current voice frame but also the historical voice frame needs to be calculated, and the historical voice frame is already calculated as the current voice frame when being received by the intelligent equipment, so that repeated analysis and calculation of the voice frame consumes more calculation resources, the memory of the voice detection module in the intelligent equipment is small, the occupancy rate of the historical voice frame to the memory is larger, and the performance of the voice detection module can be influenced.

For example, in general, data transmission of 1 voice frame requires 10ms to 30ms, and data transmission of 1 keyword requires 1 to 2s, then if it is desired to obtain relatively complete keyword sense information, it may be necessary to obtain 20 continuous voice frames, and if voice detection is performed once every 10 voice frames are received, the received 10 current voice frames and 10 historical voice frames are used as input data of a voice detection model, and the obtained keywords are detected. It can be seen that the voice detection module needs to store at least 10 voice frames as the historical voice frames each time, and the 10 historical voice frames can be repeatedly calculated in the two voice detection processes, so that the memory resource of the voice detection module is occupied, and more calculation resources are consumed.

Therefore, there is a need to redesign a voice detection method, and overcome the above-mentioned drawbacks.

Disclosure of Invention

The embodiment of the application provides a voice detection method, a device, equipment and a storage medium, which are used for reducing the memory occupancy rate and consumed computing resources of a voice detection module.

In a first aspect, an embodiment of the present application provides a method for detecting voice, including:

extracting features of voice data to be detected to obtain initial voice features;

Extracting N historical voice features obtained in a specified historical stage from the specified storage area, wherein the time interval between the specified historical stage and the current moment accords with a preset interval condition, and N is a positive integer;

sequencing the initial voice features and the N historical voice features according to the acquisition time sequence of the corresponding voice data to obtain a corresponding voice feature sequence;

extracting target context features of the initial voice features contained in the voice feature sequence; wherein the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features;

and based on the target context characteristics, obtaining target classification of the voice data to be detected, obtaining target keywords corresponding to the target classification, and storing the initial voice characteristics into a designated storage area to serve as historical voice characteristics in the next voice detection.

In a second aspect, an embodiment of the present application provides a voice detection apparatus, including:

the first feature extraction unit is used for extracting features of the voice data to be detected to obtain initial voice features;

The data access unit is used for extracting N historical voice features obtained in a specified historical stage from the specified storage area, wherein the time interval between the specified historical stage and the current moment accords with a preset interval condition, and N is a positive integer; sequencing the initial voice features and the N historical voice features according to the acquisition time sequence of the corresponding voice data to obtain a corresponding voice feature sequence;

a second feature extraction unit, configured to extract a target context feature of the initial speech feature included in the speech feature sequence; wherein the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features;

the classification unit is used for obtaining target classification of the voice data to be detected based on the target context characteristics, obtaining target keywords corresponding to the target classification, and storing the initial voice characteristics into a designated storage area to serve as historical voice characteristics in the next voice detection.

Optionally, the second feature extraction unit is specifically configured to,

inputting the voice characteristic sequence into a delay neural network with an M-layer network, wherein in the i-layer network, i is more than or equal to 1 and less than or equal to M, and executing the following operations:

When i=1, performing feature extraction on the initial voice feature and the N historical voice features based on a layer 1 network to obtain a first-layer context feature of the initial voice feature;

when i is more than or equal to 2 and less than or equal to M, extracting N i-1 layer history output characteristics stored in the appointed history stage by an i-1 layer network, and carrying out characteristic extraction on i-1 layer context characteristics output by the i-1 layer network based on the i-1 layer network by combining the N i-1 layer history output characteristics to obtain i layer context characteristics; the N i-1 layer history output features are obtained by the N history voice features through an i-1 layer network in the M layer network;

and taking the finally output M-layer context characteristics as the target context characteristics.

Optionally, the data access unit is further configured to,

and storing the i-layer context characteristics in a storage area corresponding to the i-layer network as i-layer history output characteristics in the next voice detection, and deleting the i-layer context characteristics of the history stored in the storage area corresponding to the i-layer network during the earliest voice detection, so that the number of the characteristics in the characteristic queues in the storage area corresponding to the i-layer network is N.

Optionally, the second feature extraction unit is further configured to,

if the initial voice feature and the N historical voice features are represented by floating point data types, converting the maximum floating point number in the initial voice feature and the N historical voice features into integer data types to obtain initial scaling coefficients;

forward quantization scaling is carried out on the initial voice characteristics and the N historical voice characteristics according to the initial scaling coefficient, so that the initial voice characteristics and the N historical voice characteristics of the integer data type are obtained;

then after the obtaining the first-layer context feature of the initial speech feature, further comprising:

and carrying out inverse quantization scaling on the first-layer context characteristics of the overall data type based on the initial scaling coefficient to obtain the first-layer context characteristics of the floating-point data type.

Optionally, the second feature extraction unit is further configured to,

when i is more than or equal to 2 and less than or equal to M, if the i-1 layer context feature and the N i-1 layer history output features are represented by floating point data types, converting the maximum floating point number in the i-1 layer context feature and the N i-1 layer history output features into integer data types to obtain i layer scaling coefficients;

According to the i-layer scaling coefficient, forward quantization scaling is carried out on the i-1 layer context characteristics and the N i-1 layer history output characteristics of the floating point type data type, so that i-1 layer context characteristics and N i-1 layer history output characteristics of the integer type data type are obtained;

then after the i-layer context feature is obtained, further comprising:

and carrying out inverse quantization scaling on the i-layer context characteristics of the integral data type based on the i-layer scaling coefficient to obtain i-layer context characteristics of the floating point data type.

Optionally, the second feature extraction unit is further configured to,

based on the initial voice feature and N historical voice features of the integer data type, carrying out quantization shift on model parameters of the layer 1 network;

the method further comprises the steps of, before the i-1 layer context characteristics output by the i-1 layer network are extracted by combining the N i-1 layer history output characteristics based on the i-1 layer network:

and carrying out quantization shift on the model parameters of the i-layer network based on the i-1 layer context characteristics and N i-1 layer historical output characteristics of the integer data type.

Optionally, the classifying unit is specifically configured to,

and mapping the target context characteristics by adopting a fully-connected neural network to obtain target classification of the voice data to be detected.

Optionally, the training unit is specifically configured to: the method is executed through a target voice detection model, and the training process of the target voice detection model is as follows:

performing multiple rounds of iterative training on a voice detection model to be trained based on a preset training sample set, wherein each training sample comprises sample voices to be detected, N historical sample voices and a classification label, and the sample voices to be detected and the N historical sample voices are ordered according to acquisition time; wherein, in a round of iterative process, the following operations are performed:

respectively extracting characteristics of sample voices to be detected and N historical sample voices in a training sample to obtain characteristics of the voices to be detected and N sample voices;

extracting sample context characteristics of the voice characteristics to be detected from the voice characteristics to be detected and the N sample voice characteristics; wherein the sample context feature characterizes: semantic relationships between the to-be-detected voice features and the N sample voice features;

based on the sample context characteristics, a classification result is obtained, and parameters of the voice detection model are adjusted according to the classification result and the difference of the classification labels of the training sample.

In a third aspect, an embodiment of the present application provides a computer device, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, causes the processor to execute any one of the voice detection methods in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium including a computer program, where the computer program is configured to cause a computer device to perform any one of the above-mentioned methods for detecting speech when the computer program is run on the computer device.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium; when a processor of a computer device reads the computer program from a computer-readable storage medium, the processor executes the computer program so that the computer device performs any one of the above-described voice detection methods of the first aspect.

The beneficial effects of the application are as follows:

in this embodiment, in at least one round of voice detection before the present round of voice detection, each round of voice detection uses a voice encoder to encode received voice data into historical voice features, and the obtained historical voice features are stored in a designated storage area. Therefore, when the current round of voice detection needs to be combined with the historical voice data for detection analysis, the voice encoder can be directly adopted to encode the voice data to be detected to obtain the initial voice characteristics, and further, the required historical voice characteristics are extracted from the appointed storage area and used for detection analysis of the voice data to be detected, so that the target keywords corresponding to the target classification of the voice data to be detected are obtained.

Compared with the prior art, the method has the advantages that the current voice frame and a plurality of historical voice frames are input into the voice detection model to obtain the detection result, so that the historical voice frames occupy a large memory, and in addition, the historical voice data is required to be repeatedly encoded, the method stores the encoded historical voice characteristics, the historical voice characteristics can be directly extracted in the voice detection process without repeatedly encoding the historical voice data, the memory occupancy rate and consumed computing resources are effectively reduced, the performance of voice detection equipment is improved, and the detection speed is accelerated.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for a person having ordinary skill in the art.

Fig. 1 is a schematic diagram of sliding window type flow type reasoning in a voice detection process of each voice frame of voice data according to an embodiment of the present application;

fig. 2 is an optional schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of a voice detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a voice data conversion method according to an embodiment of the present application;

fig. 5 is a schematic diagram of correspondence between voice data and voice features according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for feature extraction using a pre-training model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a method for extracting features by using a long-short-term memory network according to an embodiment of the present application;

fig. 8 is a schematic diagram of a method for extracting features by using a time delay neural network model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a classification method for classifying a target context feature using a fully connected neural network according to an embodiment of the present application;

fig. 10 is a schematic diagram of a voice detection method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a method for extracting a target context feature according to an embodiment of the present disclosure;

Fig. 12 is a schematic structural diagram of a delay neural network with an M-layer network according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a delay neural network with an M-layer network according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a time delay neural network according to an embodiment of the present application;

fig. 15 is a schematic diagram of a quantization scaling method according to an embodiment of the present application;

fig. 16 is a schematic diagram of a quantization scaling method according to an embodiment of the present application;

fig. 17 is a schematic diagram of a quantization scaling method according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a model parameter quantization scaling method according to an embodiment of the present application;

fig. 19 is a schematic diagram of a voice detection apparatus according to an embodiment of the present application;

FIG. 20 is a schematic diagram of a hardware configuration of a computer device to which embodiments of the present application are applied;

fig. 21 is a schematic diagram of a hardware composition structure of another computer device to which the embodiments of the present application are applied.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure. Embodiments and features of embodiments in this application may be combined with each other arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

It will be appreciated that in the following detailed description of the present application, related data such as voice data, sample voice to be detected, and historical sample voice will be referred to, when the embodiments of the present application are applied to specific products or technologies, related permissions or consents will be obtained, and the collection, use and processing of related data will be required to comply with the relevant laws and regulations and standards of the relevant country and region. For example, where relevant data is required, this may be implemented by recruiting relevant volunteers and signing the relevant agreement of volunteer authorisation data, and then using the data of these volunteers; alternatively, by implementing within the scope of the authorized allowed organization, relevant recommendations are made to the organization's internal members by implementing the following embodiments using the organization's internal member's data; alternatively, the relevant data used in the implementation may be analog data, for example, analog data generated in a virtual scene.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

the time delay neural network (TimeDelayNeuralNetwork, TDNN) is an artificial neural network structure for processing sequence data, and is a first type of multiple CNN models used for a voice recognition technology, the models apply convolution operation on a time axis and a frequency axis, can adapt to dynamic time domain feature changes, have fewer parameters, the hidden layer features of the TDNN are related to input of the current moment and input of the past moment and the future moment, and the input of each layer of the TDNN is obtained through a context window of the upper layer, so that the time relationship between nodes of the upper layer and the lower layer can be described.

Speech recognition technology (Automatic Speech Recognition, ASR), also known as automatic speech recognition, aims at converting lexical content in human speech into computer readable inputs, such as keys, binary codes or character sequences. Speech recognition technology belongs to an important branch of artificial intelligence, and relates to a plurality of subjects, such as signal processing, computer science, linguistics, acoustics, physiology, psychology and the like, and is a key link in man-machine natural interaction technology. Speech recognition is technically more complex than speech synthesis, but is more widely used. The biggest advantage of speech recognition ASR is to make human-machine user interfaces more natural and easy to use.

A frame, which is a unit of small data volume transmitted over a network, is made up of several parts, different parts performing different functions. In ethernet data transmission, frames are shaped by a specific software called network driver and then sent over the network card to the network cable, through the network cable to their destination machine, where the opposite process is performed at one end. The ethernet card of the receiving machine captures these frames and notifies the operating system that the frames have arrived, which are then stored. The "frame" data consists of two parts: frame header and frame data. The header includes the location of the host physical address of the receiving end machine and other network information. The frame data area contains a data body. To ensure that the machine is able to interpret the data in the data frame, the two machines use a common communication protocol. The communication protocol used by the internet is abbreviated as IP, i.e., the internet protocol. The IP data body consists of two parts: a data body header and a data area of the data body. The data body header includes an IP source address and an IP destination address, as well as other information. The data area of the data body includes User Data Protocol (UDP), transmission Control Protocol (TCP), and other information of the data packet. These data packets all contain additional process information as well as actual data.

Full tie layer: each node is connected with all nodes of the upper layer and is used for integrating the features extracted by the front edge. Because of the fully-connected characteristic, the parameters of the general fully-connected layer are the most, the influence of the characteristic position on the classification result can be reduced, and the robustness of the whole deep neural network is improved.

Regularization layer: including LN (layer normalization ), is a method proposed for the natural language processing domain in order to convert an input into data with a mean of 0 and a variance of 1. Normalization is typically performed before the data is fed into the activation function in order to hope that the input data does not fall in the saturation region of the activation function. The gradient vanishing/gradient explosion phenomenon in DNN training is relieved, and the training speed of the model is accelerated.

An activation function (Activation Function), a function running on a neuron of the artificial neural network, is responsible for mapping the input of the neuron to the output. Sigmoid functions are often used as activation functions for neural networks, mapping variables between 0,1, as double-sided saturation activation functions. The relu function is used as an activation function for a neural network, the gradient is constant for a portion greater than 0, and the derivative of the relu function is 0 for a portion less than 0, so once the neuron activation value enters the negative half-zone, the gradient is 0, and the neuron does not undergo training. Only if the neuron activation value enters the positive half-zone will there be a gradient value, and the neuron will train this once (boost).

Quantification: the vector or matrix of floating-point data types is converted into a vector or matrix of integer data types having a bit number.

Inverse quantization: a vector or matrix of integer data types having a bit number is converted to a vector or matrix of floating point data types.

Quantization scaling: in the quantization process, the floating point type data is converted into the scaling factor required by the integer type data.

Floating point numbers are digital representations of numbers belonging to a particular subset of rational numbers, used in a computer to approximate any real number. In particular, this real number is obtained by multiplying an integer or fixed point number (i.e. mantissa) by the integer power of a certain radix (usually 2 in a computer), this representation being similar to the scientific counting method with a radix of 10. A floating point number a is represented by two numbers m and e: a=m×b≡e.

Model quantization: the floating point number model is converted into an integer model with a bit number to reduce the space in which the model is stored in ROM and to load the required space in RAM.

The technical solution of the embodiments of the present application relates to artificial intelligence, speech recognition technology and machine learning technology, and artificial intelligence (Artificial Intelligence, AI) is a theory, method, technology and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.

The scheme provided by the embodiment of the application relates to an artificial intelligence voice recognition technology and a machine learning technology. The voice detection model provided by the embodiment of the application is mainly applied to combining N historical sample voices, and the sample voices to be detected are analyzed and detected to obtain a classification result so as to obtain keywords corresponding to the classification result. The training and using method of the voice detection model can classify two parts, including a training part and an application part; the training part relates to the technical field of voice recognition technology and machine learning, in the training part, a voice detection model carries out voice recognition processing, coding processing, feature extraction processing and the like on sample voice to be detected and historical sample voice through the voice recognition technology, the voice detection model is trained through the machine learning technology, model parameters are continuously adjusted through an optimization algorithm until the model converges, and corresponding intra-model relevant parameters are obtained after training samples pass through the voice detection model; the application part relates to the technical field of voice recognition technology and machine learning, in the application part, a voice detection model encodes voice data to be detected through the voice recognition technology, extracts initial voice characteristics of the voice data to be detected, combines N historical voice characteristics obtained in an instruction historical stage, analyzes and detects the initial voice characteristics to obtain target classification of the voice data to be detected, determines target keywords of the voice data to be detected, and obtains relevant parameters in the model through the machine learning by using the voice detection model trained in the training part, so that the corresponding target keywords and the like are obtained after the voice data to be detected are input into the voice detection model based on the relevant parameters in the model. In addition, it should be noted that the artificial neural network model in the embodiment of the present application may be online training or offline training, which is not specifically limited herein, and is illustrated herein by taking offline training as an example.

The following briefly describes the design concept of the embodiment of the present application:

a smart device (intelligent device) refers to any device, appliance or machine having computing processing capabilities. The intelligent equipment with complete functions must have sensitive and accurate sensing function, correct thinking and judging function and effective executing function. With the development of voice technology, many intelligent devices have a voice recognition function, so as to implement the voice control of the intelligent devices to perform corresponding operations. For example, a smart phone, a smart home, a smart sound box, and the like, and for example, a smart air conditioner in the smart home can receive the power on or power off of the voice data expression, receive the temperature information in the voice data, and set the air conditioner to the specified temperature. It can be seen that the intelligent device needs to receive, analyze and detect the voice data in real time, quickly obtain the corresponding keywords, determine the keyword-voice command corresponding to the voice data, and execute the program corresponding to the voice command, so as to ensure that the intelligent device can timely respond to the requirement of the user. In practice, however, the voice data is transmitted in a small data amount, and a piece of voice data may be transmitted in a plurality of voice frames, so that the intelligent device analyzes and detects all voice frames of the piece of voice data after receiving all voice frames included in the piece of voice data, which may possibly cause a long response delay.

In the related art, in order to ensure the timeliness of the response of the intelligent device, a streaming reasoning method is generally adopted to perform voice detection, and in the voice detection process, each time a preset number of current voice frames are received, the preset number of current voice frames are analyzed and detected by combining with a fixed number of historical voice frames received in the latest time, so that keywords corresponding to the current voice frames are obtained, and a voice instruction is obtained. Therefore, on the time axis of the transmission process of each voice frame of voice data, a sliding window type flow reasoning is presented, as shown in fig. 1, which is a schematic diagram of sliding window type flow reasoning of each voice frame of voice data provided in the embodiment of the application in the voice detection process, wherein, when 8 current voice frames are received, 8 historical voice frames at the latest moment are taken out from stored historical voice frames, and the 8 current voice frames are analyzed and detected in combination with the 8 historical voice frames, so that timeliness of obtaining key words-voice instructions through detection is ensured. In practical application, although the timeliness of voice detection can be ensured, when receiving 8 historical voice frames, the method needs to calculate the 8 historical voice frames, analyze and detect keywords of the 8 historical voice frames, and after receiving 8 current voice frames, calculate the 8 historical voice frames, which is equivalent to that each voice frame in a section of voice data needs to be repeatedly calculated at least twice, and consumes more calculation resources. In addition, the voice detection module for executing the voice detection method generally belongs to an edge computing device in the intelligent device or an embedded edge device for providing the voice detection service in the service platform, and the edge computing device or the embedded edge device is generally used for meeting real-time data computation, but the memory is small and only can meet the computation of small data volume, so that the memory occupancy rate of the voice detection module becomes large due to the storage of the historical voice frame, and the operation performance of the voice detection module is affected.

In view of this, embodiments of the present application provide a voice detection method, apparatus, computer device, and storage medium, in at least one round of voice detection before the present round of voice detection, each round of voice detection employs a voice encoder, encodes received voice data into historical voice features, and stores the obtained historical voice features in a designated storage area. Thus, when the current round of voice detection needs to be combined with the historical voice data for detection analysis, a voice encoder can be directly adopted to encode the voice data to be detected to obtain initial voice characteristics, further, N required historical voice characteristics obtained in the instruction history stage are extracted from a designated storage area, and the detection analysis of the voice data to be detected is combined with the semantic relation of the N historical voice characteristics and the context of the initial voice characteristics of the voice data to be detected, so that target keywords corresponding to target classification of the voice data to be detected are obtained.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.

Fig. 2 is a schematic view of an application scenario in an embodiment of the present application. The application scenario diagram includes any one of a plurality of terminal devices 210 and any one of a plurality of servers 220.

In the embodiment of the present application, the terminal device 210 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an electronic book reader, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like; the terminal device may be provided with a client related to the voice detection service, where the client may be software (such as a browser, communication software, etc.), or may be a web page, an applet, etc., and the server 220 is a background server corresponding to the software or the web page, the applet, etc., or is a background server specifically configured to provide the voice detection service to the client, which is not specifically limited in this application. The server 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

It should be noted that, the voice detection method in the embodiment of the present application may be performed by a computer device, which may be the server 220 or the terminal device 210, that is, the method may be performed by the server 220 or the terminal device 210 alone, or may be performed by the server 220 and the terminal device 210 together.

For example, when the terminal device 210 and the server 220 perform the same, the client in the terminal device 210 continuously collects the voice data of the object to be used, and continuously transmits the collected voice data of the object to be used to the server 220, the server 220 continuously receives the voice data sent by the terminal device 210, performs multiple rounds of voice detection, performs feature extraction on the voice data to be detected in each round of voice detection, obtains initial voice features, and stores the initial voice features in a designated storage area. Extracting N historical voice features obtained in a specified historical stage from a specified storage area; and sequencing the initial voice characteristics and the N historical voice characteristics according to the acquisition time sequence of the corresponding voice data to obtain a corresponding voice characteristic sequence. Extracting target context characteristics of initial voice characteristics contained in the voice characteristic sequence, obtaining target classification of voice data to be detected based on the target context characteristics, obtaining target keywords corresponding to the target classification, determining corresponding voice instructions according to the target keywords, and sending the corresponding voice instructions to the terminal equipment 210 so that the terminal equipment executes corresponding programs.

For example, when the terminal device 210 or the server 220 performs the voice data of the object to be used alone, the terminal device 210 or the server 220 continuously collects the voice data of the object to be used, performs multiple rounds of voice detection, performs feature extraction on the voice data to be detected in each round of voice detection, obtains initial voice features, and stores the initial voice features in a designated storage area. Likewise, N historical speech features obtained in a specified history stage are extracted from the specified storage area; the N historical voice features are combined to detect the initial voice features, obtain the target context features, then obtain the target keywords, and determine the corresponding voice instructions according to the target keywords, so that the terminal device 210 or the server 220 executes the corresponding programs.

It should be noted that, the number of terminal devices and servers and the communication manner are not limited in practice, and are not specifically limited in the embodiment of the present application, as shown in fig. 1 for illustration only.

In addition, the voice detection method and the voice detection device can be applied to various scenes, such as the scenes of starting and parameter setting control of intelligent equipment, such as intelligent lamps and refrigerators, intelligent voice driving, intelligent voice assistants of intelligent terminals, voice control of intelligent medical equipment, intelligent robots and the like. When the method is applied to the technical scheme of the cloud intelligent voice assistant, the method can be mainly applied to low-resource offline voice awakening and command word recognition algorithms, and is mainly used for an RTOS (real-time system) and a DSP (digital signal processor) platform to realize the recognition of fixed awakening words and the recognition of a limited number of command words.

The speech detection method provided by the exemplary embodiments of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenario described above, and it should be noted that the application scenario described above is only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in any way in this respect.

Referring to fig. 3, a flowchart of a voice detection method provided in the embodiment of the present application is illustrated by taking a server as an execution body, and a specific implementation flow of the method is as follows:

step 301, extracting features of the voice data to be detected, and obtaining initial voice features.

In one embodiment, the voice data: the method can be a voice sent by the object to the intelligent equipment, the keywords contained in the voice correspond to corresponding voice instructions in the intelligent equipment, a preset program corresponding to the voice instructions can be executed, corresponding functions are realized, and voice control service is provided. If the object is used for sending out a voice for closing the air conditioner to the intelligent air conditioner, the keyword 'closing the air conditioner' contained in the voice corresponds to a voice instruction for closing the air conditioner in the intelligent equipment, after the intelligent air conditioner obtains the keyword contained in the keyword based on voice data, a corresponding voice instruction is obtained, and a preset program for closing the air conditioner corresponding to the voice instruction is executed. As another example, use of an object to speak "small Z," points? And the key words 'small Z, points' contained in the voice correspond to voice instructions used for acquiring the current time in the smart phone, after the smart phone acquires the key words contained in the voice data, acquiring corresponding voice instructions, and executing a preset program corresponding to the voice instructions and broadcasting the current time.

In one embodiment, the voice data to be detected may be 1 voice frame, 2 voice frames, 3 voice frames, etc., where the data amount contained in the voice data to be detected is not specifically limited, and may be set according to the real-time requirement.

In one embodiment, the voice data to be detected is 1 voice frame, and voice detection is performed once every 1 voice frame is received, so that the voice detection speed is improved to the greatest extent, and the target classification of voice detection is quickly obtained, so that the voice data to be detected is quickly responded. If a piece of voice data contains 40 voice frames, a round of voice detection is performed every 1 voice frame received, i.e. the size of the voice data to be detected is 1 voice frame. In the voice detection, the analysis and detection time of each voice frame is assumed to be 10ms, if the 40 th voice frame is received, the detection is completed only by 10ms, after the target classification can be obtained, the response can be quickly made, the sense of the user can be that the intelligent device just sends out voice, the intelligent device responds to the voice, and the user experience is good for the user. In the related art, when 10 voice frames are received, 10 voice frames are extracted from the stored historical voice frames, and voice detection is performed, so that a round of detection takes 200ms, and the sense of the used object is that the voice is sent out for a period of time, so that the corresponding is made, the use experience of the used object is poor, and the detection of 40 voice frames takes 8000ms, and more time is consumed for voice detection.

In an embodiment, feature extraction is performed on the voice data to be detected, which may be that after the voice data to be detected is received, a method such as a voice encoder or a voice converter is adopted to convert the voice data to be detected into initial voice features, as shown in fig. 4, which is a voice data conversion method provided in the embodiment of the present application.

In one embodiment, when the initial speech feature is stored in the designated storage area, the earliest historical speech feature in the designated storage area may be deleted, ensuring that only the required historical speech feature is stored in the designated storage area.

In one embodiment, the designated storage area may be a memory or cache in the voice detection module, or the like. The specific storage area is not limited here, and may be set as needed.

Step 302, extracting N historical voice features obtained in a specified historical stage from a specified storage area; the time interval between the appointed historical stage and the current moment accords with a preset interval condition, and N is a positive integer.

In one embodiment, the current time may be the time of receiving the voice data to be detected, or may be the time of obtaining the initial voice feature of the voice data to be detected, or may be the time of needing to extract the historical voice feature from the designated storage area, where the current time is not limited and may be set as required.

In one embodiment, the preset interval condition may be that enough historical voice features are acquired, and then a historical stage is specified, which is a time interval between the acquisition time of the earliest historical voice feature in the positive integer number N of historical voice features and the current time; or, the preset interval condition may be that the historical voice features are acquired in a fixed time window, N positive integer number of the historical voice features nearest to the current time are acquired in the fixed time window of the current time (if the acquired historical voice features are less than N, default values may be adopted to complement each other so as to ensure the data quantity with the N historical voice features), and then a historical stage is designated, so as to obtain the time interval between the acquired time and the current time of the earliest historical voice feature in the acquired historical voice features; alternatively, the preset interval condition may be that the historical voice feature in the voice detection of the previous fixed round is acquired, and the specified historical stage refers to the time interval corresponding to the voice detection of the fixed round. Here, the preset interval condition is not particularly limited, and may be set as needed.

Step 303, sorting the initial voice feature and the N historical voice features according to the acquisition time sequence of the corresponding voice data, so as to obtain a corresponding voice feature sequence.

In one embodiment, as shown in fig. 5, a schematic diagram of correspondence between voice data and voice features is provided in this embodiment, where in a fourth voice detection in progress, the voice data to be detected is received, feature extraction is performed on the voice data (tools such as a voice encoder or a voice converter may be adopted) to obtain corresponding initial voice features, in the previous three voice detection, the history voice data of the first voice detection corresponds to its own history voice features, the history voice data of the second voice detection corresponds to its own history voice features, and the history voice data of the third voice detection corresponds to its own history voice features.

In an embodiment, based on the foregoing embodiment, the initial voice feature and the N historical voice features are ordered according to an acquisition time sequence of corresponding voice data, where the corresponding voice data includes to-be-detected voice data of the initial voice feature and a historical voice feature of the corresponding voice feature, and the voice feature sequence may be: X1X 2X 3X 4, or X4X 3X 2X 1.

Step 304, extracting target context characteristics of initial voice characteristics contained in the voice characteristic sequence; wherein, the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features.

In one embodiment, as shown in fig. 6, a schematic diagram of a method for extracting features by using a pre-training model according to an embodiment of the present application may use a pre-training model (Bidirectional Encoder Representations from Transformer, BERT model, a bidirectional encoder representation based on a transducer, which is a pre-trained language characterization model), to extract semantic relationships between initial speech features and N historical speech features in a speech feature sequence, so as to obtain target context features of the initial speech features.

In one embodiment, as shown in fig. 7, a long and short term memory network (Long Short Term Memory Network, LSTM, which is a modified recurrent neural network that can solve the problem that RNNs cannot handle long distance dependence) may be used to extract semantic relationships between initial speech features and N historical speech features in a speech feature sequence, and obtain target context features of the initial speech features.

In one embodiment, as shown in fig. 8, a time-delay neural network model may be used to extract semantic relationships between the initial speech feature and N historical speech features in the speech feature sequence, and obtain target context features of the initial speech feature.

In one embodiment, the target context feature includes both semantic information included in the initial speech feature and semantic relationships between the initial speech feature and the N historical speech features.

Step 305, obtaining target classification of the voice data to be detected based on the target context characteristics, obtaining target keywords set corresponding to the target classification, and storing the initial voice characteristics into a designated storage area to serve as historical voice characteristics in the next round of voice detection.

In one embodiment, a decision tree method is adopted to classify the target context features, so as to obtain target classification of the target context features, i.e. obtain target classification of the voice data to be detected corresponding to the target context features.

In one embodiment, a random forest method is used to classify the target contextual features to obtain a target classification of the target contextual features. Here, the classification method for classifying the target context features is not particularly limited, and may be set as required, for example, a support vector machine, a logistic regression algorithm, and the like.

Based on the above-mentioned method flow in fig. 3, the embodiment of the present application provides a classification method for classifying target context features, in step 305, obtaining target classification of voice data to be detected based on the target context features, including:

In one embodiment, as shown in fig. 9, a schematic diagram of a classification method for classifying a target context feature by using a fully connected neural network is provided in an embodiment of the present application.

In an embodiment, as shown in fig. 10, a schematic diagram of a voice detection method according to an embodiment of the present application is provided. Assuming that the voice sent by the object contains 4 voice data, each voice data corresponds to one round of voice detection method, 3 rounds of voice detection are currently executed: the method comprises the steps that a voice encoder, a target context characteristic extraction model and a classification model are adopted, historical voice data X1, historical voice data X2 and historical voice characteristics Y3 of each historical voice characteristic Y1, historical voice characteristics Y2 and historical voice characteristics Y3 are obtained, and historical classification 1, historical classification 2 and historical classification 3 of each historical voice data X1, historical voice data X2 and historical voice characteristics X3 are obtained; the flow of execution is as follows,

specifically, in the first round of speech detection T1, the history speech data X1 is transmitted to the speech encoder to obtain the history speech feature Y1, the history speech feature Y1 is stored in the designated storage area, the history speech feature Y1 is transmitted to the target context feature extraction model to obtain the target context feature of the history speech feature Y1, and the target context feature of the history speech feature Y1 is transmitted to the classification model to obtain the target class 1.

In the second-round voice detection T2, the history voice data X2 is transmitted to the voice encoder to obtain the history voice feature Y2, the history voice feature Y2 is stored in a designated storage area, the history voice feature Y1 is extracted from the designated storage area, the history voice feature Y1 and the history voice feature Y2 are transmitted to the target context feature extraction model to obtain the target context feature of the history voice feature Y2, and the target context feature of the history voice feature Y2 is transmitted to the classification model to obtain the target classification 2.

In the third-round voice detection T3, the history voice data X3 is transmitted to the voice encoder to obtain the history voice feature Y3, the history voice feature Y3 is stored in the designated storage area, the history voice feature Y2 is extracted from the designated storage area, the history voice feature Y2 and the history voice feature Y3 are transmitted to the target context feature extraction model to obtain the target context feature of the history voice feature Y3, and the target context feature of the history voice feature Y3 is transmitted to the classification model to obtain the target classification 3.

After receiving the voice data X4 to be detected, performing a fourth-round voice detection, in the third-round voice detection T4, transmitting the voice data X4 to be detected to a voice encoder, obtaining an initial voice feature Y4, storing the initial voice feature Y4 in a designated storage area, extracting a historical voice feature Y3 from the designated storage area, transmitting the initial voice feature Y4 and the historical voice feature Y3 to a target context feature extraction model, obtaining a target context feature of the initial voice feature Y4, and transmitting the target context feature of the initial voice feature Y4 to a classification model, thereby obtaining a target classification.

In this flow, the number of the history speech features obtained from the designated storage area may be plural at a time, for example, in the third-round speech detection, the history speech features Y1 and Y2 are obtained from the designated storage area, and transmitted to the target context feature extraction model together with the history speech features Y3, so that the target context features of the history speech features Y3 are obtained. That is, the positive integer N of the extracted history speech feature is not particularly limited here, and may be set as needed.

The method stores the coded historical voice characteristics, can directly extract the historical voice characteristics in the voice detection process, and can analyze and detect the voice data to be detected by combining the historical voice characteristics without repeated coding of the historical voice data, thereby effectively reducing the memory occupancy rate and the consumed computing resources, improving the performance of voice detection equipment and accelerating the detection speed.

Based on the above-mentioned method flow of fig. 3, an embodiment of the present application provides a method for extracting a target context feature, as shown in fig. 11, in step 304, extracting a target context feature of an initial voice feature included in a voice feature sequence, including:

Inputting a voice characteristic sequence into a delay neural network with an M-layer network, wherein in the i-layer network, i is more than or equal to 1 and less than or equal to M, and executing the following operations:

step 1101, when i=1, performing feature extraction on the initial voice feature and N historical voice features based on the layer 1 network to obtain a first layer context feature of the initial voice feature;

step 1102, when i is more than or equal to 2 and less than or equal to M, extracting N i-1 layer historical output characteristics stored in a specified historical stage of an i-1 layer network, and carrying out characteristic extraction on i-1 layer context characteristics output by the i-1 layer network based on the i-1 layer network and combining the N i-1 layer historical output characteristics to obtain i layer context characteristics; the N i-1 layer history output features are obtained by N history voice features through an i-1 layer network in the M layer network; and taking the finally output M-layer context characteristics as target context characteristics.

In an embodiment, as shown in fig. 12, for a schematic structural diagram of a delay neural network with an M-layer network according to the embodiment of the present application, assuming that n=4 and m=3, the following steps are performed:

when i=1, based on the layer 1 network, for the initial speech feature X (t) and 4 historical speech features: the voice feature sequences of the historical voice feature X (t-1), the historical voice feature X (t-2), the historical voice feature X (t-3) and the historical voice feature X (t-4) are subjected to feature extraction to obtain the first-layer context feature of the initial voice feature.

When i=2, based on the layer 2 network, the first layer context feature is extracted by combining the history output feature of the layer 1 network output obtained in the history stage, and the layer 2 context feature of the initial voice feature is obtained. In the history stage, the layer 1 network stores the output layer 1 history output characteristics, and can be used for the current round of voice detection.

When i=3, based on the layer 3 network, in combination with the history output feature of the layer 2 network output obtained in the history stage, feature extraction is performed on the layer 2 context feature, and the layer 3 context feature of the initial speech feature is obtained. In the history stage, the layer 2 network stores the output layer 2 history output characteristics, and can be used for the current round of voice detection.

When i=4, based on the layer 4 network, in combination with the history output feature of the layer 3 network output obtained in the history stage, feature extraction is performed on the layer 3 context feature, and a layer 4 context feature of the initial speech feature is obtained, where the layer 4 context feature is the target context feature. In the history stage, the layer 3 network stores the output layer 3 history output characteristics, and can be used for the current round of voice detection.

Note that, the N, M may be any positive integer, and may be set according to specific needs during application, which is not limited herein.

In an embodiment, based on the foregoing embodiment, a schematic structural diagram of a delay neural network with an M-layer network is provided in this embodiment, as shown in fig. 13, where each delay neural network layer is formed by a layer of one-dimensional causal hole convolution, a layer of batch normalization and a layer of Relu activation function, and an output of a last delay neural network layer of a delay neural network model passes through a classification layer, and outputs a target classification of speech data to be detected—a posterior probability of a target keyword, where the posterior probability is regarded as the model detects the target keyword if the posterior probability is greater than a predetermined threshold. Assuming that n=4 and m=3, wherein the shaded square indicates the hole position in the delay neural network layer, the hole position is the 2 nd feature and the 4 th feature position, and the features corresponding to the hole positions do not participate in convolution, the following steps are executed:

when i=1, based on layer 1 network, for initial speech feature X ₁ (t) and 4 historical speech features: historical speech feature X ₁ (t-1), historical Speech feature X ₁ (t-2), historical Speech feature X ₁ (t-3), historical Speech feature X ₁ Feature extraction of the Speech feature sequence of (t-4), wherein the historical Speech feature X ₁ (t-1) and historical Speech feature X ₁ (t-3) without participating in the convolution, normalizing and activating the feature obtained by the convolution to obtain a first-layer context feature of the initial speech feature.

When i=2, based on the layer 2 network, 4 history output features of the layer 1 network output obtained in the history phase are combined: layer 1 historical output feature X ₂ (t-1), layer 1 historical output feature X ₂ (t-2), layer 1 historical output feature X ₂ (t-3), layer 1 historical output feature X ₂ (t-4) feature extraction of first-tier context features, wherein a tier 1 history outputs features X ₂ (t-1) and layer 1 historical output feature X ₂ (t-3) without participating in the convolution, normalizing and activating the feature obtained by the convolution to obtain a layer 2 context feature of the initial speech feature. In the history stage, the layer 1 network stores the output layer 1 history output characteristics, and can be used for the current round of voice detection.

When i=3, based on the layer 3 network, 4 historical output features of the layer 2 network output obtained in the history phase are combined: layer 2 historical output feature X ₃ (t-1), 2-layer historical output feature X ₃ (t-2), historical output feature 2 layer X ₃ (t-3), 2-tier historical output feature X ₃ (t-4) feature extraction of layer 2 contextual features, where the history output features layer 2X ₃ (t-1) and 2-tier historical output feature X ₃ (t-3) without participating in the convolution, normalizing and activating the feature obtained by the convolution to obtain a layer 3 context feature of the initial speech feature. In the history stage, the layer 2 network stores the output layer 2 history output characteristics, and can be used for the current round of voice detection.

When i=4, based on the layer 4 network, combine in calendar4 historical output features of layer 3 network output obtained in the history phase: 3-tier historical output feature X ₄ (t-1), 3-layer historical output feature X ₄ (t-2), 3-layer historical output feature X ₄ (t-3), 3-layer historical output feature X ₄ (t-4) feature extraction of 3 first layer context features, wherein 3 layers of history output features X ₄ (t-1) and 3-tier historical output feature X ₄ (t-3) without participating in the convolution, normalizing and activating the feature obtained by the convolution to obtain a 4-layer context feature of the initial speech feature. In the history stage, the layer 3 network stores the output layer 3 history output characteristics, and can be used for the current round of voice detection.

The one-dimensional causal hole convolution is a special one-dimensional convolution layer, the calculation process is carried out on a time axis of an input characteristic, and the causal effect is represented in the situation that the output of each moment is only related to the input historical voice characteristic (corresponding historical voice data), the corresponding convolution kernel number in each layer of time delay neural network in the upper diagram is equal to or less than 3, and the hole number condition is equal to or less than 2. It should be noted that, the N, M may be any positive integer, and may be set according to specific needs when applied, and each neural network layer may be a one-dimensional causal hole convolution, or a multidimensional causal hole convolution, which is not limited herein.

Based on the above method flow in fig. 11, the embodiment of the present application provides a feature storage method in voice detection, which further includes:

and storing the i-layer context characteristics in a storage area corresponding to an i-layer network as i-layer history output characteristics in the next round of voice detection, and deleting the i-layer context characteristics of the history stored in the storage area corresponding to the i-layer network during the earliest round of voice detection, so that the number of the characteristics in the characteristic queue in the storage area corresponding to the i-layer network is N.

In one embodiment, based on the above-mentioned structure of the delay neural network in fig. 12 and fig. 13, the embodiment of the present application further provides a schematic structural diagram of the delay neural network, as shown in fig. 14, still assuming that n=4 and m=3 (it is to be understood that N and M are only used to clearly illustrate the present application and not to limit the specific implementation of the present application, N and M may be any positive integers), where the hatched square indicates the position of the hole in the layer of the delay neural network, and the position of the hole is the 2 nd feature and the 4 th feature (the number of holes in the present embodiment is also only used to clearly illustrate the present application and not to limit the specific implementation of the present application), and if the feature corresponding to the position of the hole does not participate in convolution, the following steps are further performed when performing the steps as shown in fig. 12 and fig. 13:

In obtaining initial speech feature X ₁ After (t), the initial speech feature X ₁ (t) storing the initial speech feature X in a specified storage area (then in subsequent speech detection ₁ (t) can be used as a history voice feature for subsequent voice detection), and in the designated storage area, history voice features X obtained in the history stage have been stored ₁ (t-1), historical Speech feature X ₁ (t-2), historical Speech feature X ₁ (t-3), historical Speech feature X ₁ （t-4）。

When i=1, the history speech feature X obtained in the history phase is extracted from the specified storage area ₁ (t-1), historical Speech feature X ₁ (t-2), historical Speech feature X ₁ (t-3), historical Speech feature X ₁ (t-4) and based on layer 1 network, combine historical speech features X ₁ (t-1), historical Speech feature X ₁ (t-2), historical Speech feature X ₁ (t-3), historical Speech feature X ₁ (t-4) for the initial Speech feature X ₁ (t) extracting features to obtain the first-layer context feature of the initial voice feature, and combining the first-layer context feature X ₂ (t) storing the first layer context feature X in a storage area corresponding to the layer 1 delay neural network (in subsequent voice detection) ₂ (t) can be used as a history output feature for subsequent voice detection, wherein the feature queue in the storage area corresponding to the layer 1 time delay neural network comprises at least N features, and the first layer context feature X is obtained ₂ (t) when stored in the storage area, at the end of the feature queue closest to the detection time of the round, isEnsuring the fixed number of features in the feature queue, deleting the first-layer context feature (the history output feature of the layer 1 time-delay neural network) of the history at the end of the feature queue, which is farthest from the detection time of the present round, wherein the layer 1 history output feature X obtained in the history stage is already stored in the storage area corresponding to the layer 1 time-delay neural network ₂ (t-1), layer 1 historical output feature X ₂ (t-2), layer 1 historical output feature X ₂ (t-3), layer 1 historical output feature X ₂ （t-4）。

When i=2, extracting a layer 1 history output feature X obtained in the history phase from a storage area corresponding to the layer 1 delay neural network ₂ (t-1), layer 1 historical output feature X ₂ (t-2), layer 1 historical output feature X ₂ (t-3), layer 1 historical output feature X ₂ (t-4) and based on the layer 2 network, combine the layer 1 history output feature X ₂ (t-1), layer 1 historical output feature X ₂ (t-2), layer 1 historical output feature X ₂ (t-3), layer 1 historical output feature X ₂ (t-4) for first layer context feature X ₂ (t) extracting features to obtain first-layer context features X ₂ Layer 2 contextual feature X of (t) ₃ (t) layer 2 context feature X ₃ (t) storing the voice data in a storage area corresponding to the layer 2 delay neural network (in subsequent voice detection, the layer 2 context characteristic X ₃ (t) can be used as a history output feature for subsequent voice detection, wherein the feature queue in the storage area corresponding to the layer 2 delay neural network comprises at least N features, and the layer 2 context feature X ₃ When (t) the characteristic data is stored in the storage area, at the end of the characteristic queue closest to the detection time of the present round, in order to ensure that the characteristic quantity in the characteristic queue is fixed, the 2-layer context characteristics (the history output characteristics of the 2 nd-layer time delay neural network) of the history at the end of the characteristic queue furthest from the detection time of the present round can be deleted, and in the storage area corresponding to the 2 nd-layer time delay neural network, the 2-layer history output characteristics X obtained in the history stage are already stored ₃ (t-1), 2-layer historical output feature X ₃ （t-2) Layer 2 history output feature X ₃ (t-3), 2-tier historical output feature X ₃ （t-4）。

When i=3, extracting the layer 2 history output feature X obtained in the history stage from the storage area corresponding to the layer 2 delay neural network ₃ (t-1), 2-layer historical output feature X ₃ (t-2), layer 2 historical output feature X ₃ (t-3), 2-tier historical output feature X ₃ (t-4) and based on the layer 3 network, combining the layer 2 history output characteristics X ₃ (t-1), 2-layer historical output feature X ₃ (t-2), layer 2 historical output feature X ₃ (t-3), 2-tier historical output feature X ₃ (t-4) for layer 2 context feature X ₃ (t) feature extraction to obtain 2-layer context feature X ₃ Layer 3 contextual characteristics X of (t) ₄ (t) layer 3 contextual feature X ₄ (t) storing the voice data in a storage area corresponding to the layer 3 delayed neural network (in subsequent voice detection, layer 3 context feature X ₄ (t) can be used as a history output feature for subsequent voice detection, wherein the feature queue in the storage area corresponding to the layer 3 delay neural network comprises at least N features, and the layer 3 context feature X ₄ When (t) the characteristic data is stored in the storage area, at the end of the characteristic queue closest to the detection time of the present round, in order to ensure that the characteristic quantity in the characteristic queue is fixed, the 3-layer context characteristic (the history output characteristic of the 3 rd-layer time delay neural network) of the history at the end of the characteristic queue farthest from the detection time of the present round can be deleted, and in the storage area corresponding to the 3 rd-layer time delay neural network, the 3-layer history output characteristic X obtained in the history stage is already stored ₄ (t-1), 3-layer historical output feature X ₄ (t-2), 3-layer historical output feature X ₄ (t-3), 3-layer historical output feature X ₄ （t-4）。

When i=4, extracting 3-layer history output features X obtained in the history stage from the storage area corresponding to the 3-layer delay neural network ₄ (t-1), 3-layer historical output feature X ₄ (t-2), 3-layer historical output feature X ₄ (t-3), 3-tier history outputFeature X ₄ (t-4) and based on the layer 4 network, combining the layer 3 history output characteristics X ₄ (t-1), 3-layer historical output feature X ₄ (t-2), 3-layer historical output feature X ₄ (t-3), 3-layer historical output feature X ₄ (t-4) for layer 3 contextual feature X ₄ (t) feature extraction to obtain 3-layer context feature X ₄ Target context feature of (t).

Based on the above-mentioned method flows in fig. 11 and fig. 3, the embodiment of the present application provides a quantization scaling method, as shown in fig. 15, before extracting the target context feature of the initial speech feature included in the speech feature sequence in step 304, further including:

step 1501, if the initial speech feature and the N historical speech features are represented by floating point data types, converting the maximum floating point number in the initial speech feature and the N historical speech features into integer data types to obtain initial scaling coefficients;

Step 1502, forward quantization scaling is performed on the initial voice feature and the N historical voice features according to the initial scaling coefficient, so as to obtain the initial voice feature and the N historical voice features of the integer data type;

then, in step 1101, after obtaining the first-layer context feature of the initial speech feature, further comprises:

and 1503, performing inverse quantization scaling on the first-layer context characteristics of the integral data type based on the initial scaling coefficient to obtain the first-layer context characteristics of the floating-point data type.

In one embodiment, if the initial speech feature and the N historical speech features are both represented by floating point data types, the maximum floating point number is selected from the initial speech feature and the N historical speech features, and the maximum floating point number is converted into integer data types, and in the conversion process, a converted initial scaling factor can be obtained.

For example, assume that the integer data type is 8-bit integer, the initial speech feature and N historical speech features are X1 _float Initial scaling factor Q1 _scale =127÷max（abs（X1 _float ）），That is, the maximum floating point number is selected from the absolute values of the initial speech feature and the N historical speech features, converted into 8-bit integer data, and the initial scaling factor required in the process is determined.

Then correspondingly, the initial voice characteristics and N historical voice characteristics are subjected to forward quantization scaling X1 by adopting the initial scaling coefficient _quant =round（X1 _float *Q1 _scale ) (round represents rounding to integer data type), obtaining initial speech features of integer data type and N historical speech features X1 _quant 。

When the initial scaling factor is obtained, in order to ensure that the layer 1 delay neural network can correspondingly perform feature processing of the integer data type, the initial voice feature and N historical voice features X1 based on the integer data type _quant Quantization scaling of model parameters of layer 1 networks, e.g. assuming one-dimensional convolution in layer 1 delayed neural networks, which includes model parameters weighted by W1 _float And bias b1 _float The model parameters are quantitatively shifted to X1 of the integer data type _quant Model parameters of the same integer data type, in this example, weight W1 _float And bias b1 _float Performing quantization shift to integer data type of 8-bit integer to obtain weight W1 _float And bias b1 _float The respective quantization parameter is W1 _quant ，b1 _quant And a quantization shift parameter W1 _shift And b1 _shift The calculation mode is as follows:

W1 _shift =7-ceil (log base 2 (max (abs (W1) _float ））））；

W1 _quant =round（W1 _flaot *（2^W1 _shift ））；

B1 _shift =7-ceil (log base 2 (max (abs (b 1) _float ））））

B1 _quant =round（b1 _flaot *（2^b1 _shift ））；

ceil: and rounding on line.

Thereafter, X1 is _quant After the layer 1 time delay neural network is input, the first layer context characteristic Y1 of the integer data type is obtained _quant =W1 _quant X1 _quant Then adopts the initial scaling factor (Q1) _scale ) Quantization shift parameter (quantization shift parameter of weight W1) _shift Offset quantization shift parameter b1 _shift ) Inverse quantization scaling Y1 of first layer context features for integer data types _float = Y1 _quant ×2 ^-W1shift ÷Q1 _scale + b1 _quant ×2 ^-b1shift The first-tier context characteristics of the floating-point data type are obtained.

In the method, the input initial voice characteristics and the historical voice characteristics are quantized and scaled by adopting a quantization scaling mode, so that the initial voice characteristics and the historical voice characteristics of the integer data types are obtained, and the layer 1 time delay neural network can calculate the characteristics of the integer data types, so that the problem that in the related technology, the model calculation scale is large and the memory occupied by the model is large due to the fact that the characteristic data characters of the floating point data types are large is solved.

Based on the above method flow in fig. 11, the embodiment of the present application provides a quantization scaling method, as shown in fig. 16, in step 1102, further including:

1601, when i is greater than or equal to 2 and less than or equal to M, if the i-1 layer context feature and the N i-1 layer history output features are represented by floating point data types, converting the maximum floating point number in the i-1 layer context feature and the N i-1 layer history output features into integer data types to obtain an i layer scaling factor;

Step 1602, performing forward quantization scaling on the i-1 layer context characteristics and N i-1 layer history output characteristics of the floating point type data type according to the i-layer scaling coefficient to obtain i-1 layer context characteristics and N i-1 layer history output characteristics of the integer type data type;

then in step 1102, after obtaining the i-layer context feature, further comprises:

and 1603, performing inverse quantization scaling on the i-layer context characteristics of the integral data type based on the i-layer scaling coefficient to obtain the i-layer context characteristics of the floating point data type.

In one embodiment, as shown in fig. 17, the embodiment of the application provides a quantization scaling method schematic diagram, in a delay neural network with M layers of networks, between input data of each layer of delay neural network and input delay neural networks, input data of a floating point type data type is firstly subjected to forward quantization scaling to obtain input data of the integer type data type, scaling coefficients of the layer of delay neural network in a conversion process are obtained, the input data passes through the delay neural network to obtain output data of the integer type data type, and the scaling coefficients of the layer of delay neural network are adopted to convert the output data of the integer type data type into the output data of the floating point type data type. Therefore, the time delay neural network with the M-layer network is a model for processing data of integer data types, and compared with a model for processing data of floating point data types, the model is smaller in scale, occupies less memory, consumes less calculation resources, effectively saves resources and improves equipment performance.

In addition, in the related art, after the model is put into use, the input data specification may be different due to different usage scenarios, that is, the input data of the floating point type data type fluctuates greatly in different scenarios. In order to ensure the accuracy of model detection, each time a use scene is replaced, calibration parameters are required to be manually obtained, the model parameters are calibrated, the operation is complex, the model cannot be automatically adapted to an application scene, and the applicability is poor. For example, in a quiet scene and a noisy scene, floating point fluctuation of the features of the floating point type data types corresponding to the voice data is often large, calibration data in the quiet scene is 10, and calibration data in the noisy scene is 300, and the calibration data are all obtained by manual calculation, and model parameters are calibrated according to the calibration data so as to ensure the accuracy of model detection in different scenes.

In the application, by adopting the method, the maximum floating point number is selected from the input data, the scaling coefficient is obtained according to the maximum floating point number, the input data of the model is converted into the integer data type, and the model parameters are converted into the same integer data type, so that the adjustment of the input data and the model parameters based on the input data is realized, the input data and the model parameters are matched, the problem of poor scene applicability is solved, and the model applicability is effectively improved.

In one embodiment, assuming that the integer data type is 8-bit integer (the integer data type referred to in the embodiments of the present application may be 8-bit integer or 16-bit integer, etc., where specific criteria for the integer data type are not limited and may be set as needed), the i-1 layer context feature and the N i-1 layer historical output features are Xi _float I-layer scaling factor Qi _scale =127÷max（abs（Xi _float ) I.e., the maximum floating point number is selected from the absolute values of the i-1 layer context feature and the N i-1 layer history output features, converted into 8-bit integer data, and the i-layer scaling factor required in the process is determined.

Then correspondingly, the i-layer context characteristics and N i-1 layer history output characteristics are subjected to forward quantization scaling Xi by adopting the i-layer scaling coefficient _quant =round（Xi _float *Qi _scale ) (round represents rounding to integer data type), obtaining i-1 layer context features and N i-1 layer historical output features Xi of integer data type _quant 。

When the i-layer scaling coefficient is obtained, in order to ensure that the i-layer delay neural network can correspondingly perform feature processing of integer data types, i-1 layer context features and N i-1 layer history output features Xi based on the integer data types _quant Quantization scaling of model parameters of the i-th layer network, e.g., assuming one-dimensional convolution in the i-th layer delay neural network, which includes model parameters with weights Wi _float Sum bias bi _float The model parameters are quantitatively shifted to Xi of integer data type _quant Model parameters of the same integer data type, in this example, weight Wi _float Sum bias bi _float Performing quantization shift to integer data type of 8-bit integer to obtain weight Wi _float Sum bias bi _float The respective quantization parameters are Wi _quant ，bi _quant And quantization shift parameter Wi _shift Sum bi _shift Calculation methodThe method comprises the following steps:

Wi _shift =7-ceil (log base 2 (max (abs (Wi) _float ））））；

Wi _quant =round（Wi _flaot *（2^Wi _shift ））；

bi _shift =7-ceil (log base 2 (max (abs (bi) _float ））））

bi _quant =round（bi _flaot *（2^bi _shift ））；

ceil: and rounding on line.

Thereafter, xi _quant After the layer 1 time delay neural network is input, the first layer context characteristics Yi of the integer data type are obtained _quant =Wi _quant Xi _quant Then the initial scaling factor (Qi _scale ) Quantization shift parameter (quantization shift parameter Wi of weight) _shift Offset quantization shift parameter bi _shift ) Inverse quantization scaling Yi for first layer context features of an integer data type _float = Yi _quant ×2 ^-W i ^shift ÷Qi _scale + bi _quant ×2 ^-b i ^shift The first-tier context characteristics of the floating-point data type are obtained.

In the method, the input i-1 layer context characteristics and N i-1 layer historical output characteristics are quantized and scaled by adopting a quantization scaling mode, so that i-1 layer context characteristics and N i-1 layer historical output characteristics of an integer data type are obtained, the ith layer time delay neural network can calculate the characteristics of the integer data type, and the problem that in the related technology, the number of characteristic data characters of a floating point data type is large, the calculation scale of a model is large, and the memory occupied by the model is large is solved.

Based on the above-mentioned method flows in fig. 11, fig. 15, and fig. 16, the embodiment of the present application provides a model parameter quantization scaling method, as shown in fig. 18, in step 1101, before feature extraction is performed on initial speech features and N historical speech features based on a layer 1 network, the method further includes:

step 1801, quantitatively shifting model parameters of the layer 1 network based on initial voice features and N historical voice features of integer data types;

based on the i-layer network, combining N i-1 layer historical output characteristics, and before extracting the characteristics of the i-1 layer context characteristics output by the i-1 layer network, further comprising:

step 1802, performing quantization shift on the model parameters of the i-th layer network based on the i-1 layer context feature and the N i-1 layer history output features of the integer data type.

In one embodiment, in a model of a delay neural network with M layers of networks, when input data of any layer of delay neural network is a floating point type data type, the input data is converted into input data of an integer type data type, and model parameters of the layer of delay neural network are quantitatively shifted to obtain model parameters of the integer type data type. Therefore, the model of the delay neural network with the M-layer network in the method can be suitable for detection and analysis of voice data in various scenes, such as voice data in a quiet scene or voice data in a noisy scene, even if the floating point data types of the two voice data have very different values, the problem that model parameters cannot be simultaneously suitable for various voice data with relatively large differences can be solved, and the model robustness is improved.

Based on the above-mentioned method flows in fig. 3, 11, 15, 16, and 18, the embodiment of the present application provides a training method for a speech detection model, and if the above-mentioned method is performed by a target speech detection model, the training process of the target speech detection model is as follows:

step 1, respectively extracting characteristics of sample voices to be detected and N historical sample voices in a training sample to obtain characteristics of the voices to be detected and N sample voices;

step 2, extracting sample context characteristics of the voice characteristics to be detected from the voice characteristics to be detected and N sample voice characteristics; wherein, sample context feature characterization: semantic relationships between the voice features to be detected and the N sample voice features;

and step 3, obtaining a classification result based on the sample context characteristics, and adjusting parameters of the voice detection model according to the classification result and the difference of the classification labels of one training sample.

In one embodiment, during the training process of the voice detection model, the data such as the historical voice characteristics, the i-layer historical output characteristics and the like of the voice detection model in the application of fig. 3, 11, 16 and 18 may be not needed, in order to ensure the training efficiency, the device used in the general training model has enough memory, and during the training process, for the voice detection model to be trained, the sample voice to be detected and the N historical sample voices in one training sample are obtained simultaneously, and the voice data to be detected need not be received by the device each time as in the application, and the analysis and the detection need to be combined with the historical voice data.

In one embodiment, the sample voice to be detected and the N historical sample voices in the training sample may be a plurality of voice data of a section of voice, for example, "XX air conditioner, please adjust the temperature to 29 °", the classification label may be 29 °, and the corresponding voice command may cause the preset program to adjust the air conditioner temperature to 29 °.

In one embodiment, the voice detection model may be a voice detection model including an M-layer delay neural network and a classification layer, which are involved in the above method flows and embodiments, or may also be a voice detection model of other support vector machines and pre-training models, where the specific structure of the voice detection model is not limited, and may be set as required.

Based on the above methods and embodiments, the present application provides experimental data of a voice detection model based on a time-delay neural network, and table 1 below shows that the performance of the model under various scenarios, especially under the complex acoustic scenarios such as noise far field, can be improved by performing dynamic quantization in the related technology, such as the above-mentioned dynamic quantization in the flow of the method of fig. 11, fig. 15, and fig. 16, and the change of the keyword detection rate (detection rate: the ratio of correctly detected keywords by the voice detection model) and the CPU and the memory occupancy rate, where the CPU occupancy rate is obtained by performing a test on a real-time system of 240Mhz with a single core and a platform with low resource requirement, and the phenomenon can be effectively improved by performing dynamic quantization in the flow of the method of fig. 11, fig. 15, and fig. 16.

Based on the same concept, the embodiment of the present application provides a voice detection apparatus 1900, as shown in fig. 19, including:

a first feature extraction unit 1901, configured to perform feature extraction on voice data to be detected, so as to obtain an initial voice feature;

A data access unit 1902, configured to extract N historical speech features obtained in a specified historical stage from the specified storage area, where a time interval between the specified historical stage and a current time accords with a preset interval condition, and N is a positive integer; sequencing the initial voice features and the N historical voice features according to the acquisition time sequence of the corresponding voice data to obtain a corresponding voice feature sequence;

a second feature extraction unit 1903, configured to extract a target context feature of the initial speech feature included in the speech feature sequence; wherein the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features;

and a classification unit 1904, configured to obtain, based on the target context feature, a target classification of the voice data to be detected, obtain a target keyword set corresponding to the target classification, and store the initial voice feature in a designated storage area as a historical voice feature in the next round of voice detection.

Optionally, the second feature extraction unit 1903 is specifically configured to,

Optionally, the data access unit 1902 is further configured to,

Optionally, the second feature extraction unit 1903 is further configured to,

then after the i-layer context feature is obtained, further comprising:

Optionally, the second feature extraction unit 1903 is further configured to,

Optionally, the classifying unit 1904 is specifically configured to,

Optionally, the training unit 1905 is specifically configured to: the method is executed through a target voice detection model, and the training process of the target voice detection model is as follows:

Based on the same inventive concept as the above-mentioned method embodiments, a computer device is also provided in the embodiments of the present application. In one embodiment, the computer device may be a server, such as server 220 shown in FIG. 2. In this embodiment, the computer device may be configured as shown in FIG. 20, including a memory 2001, a communication module 2003, and one or more processors 2002.

A memory 2001 for storing a computer program for execution by the processor 2002. The memory 2001 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

Memory 2001 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory 2001 may be a nonvolatile memory (non-volatile memory), such as a read only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 2001, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 2001 may be a combination of the above.

The processor 2002 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. The processor 2002 is used to implement the above-described voice detection method when calling the computer program stored in the memory 2001.

The communication module 2003 is used for communication with the terminal device and other servers.

The specific connection medium between the memory 2001, the communication module 2003 and the processor 2002 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 2001 and the processor 2002 are connected by a bus 2004 in fig. 20, and the bus 2004 is depicted by a thick line in fig. 20, and the connection manner between other components is only schematically illustrated and not limited thereto. The bus 2004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 20, but only one bus or one type of bus is not depicted.

The memory 2001 stores therein a computer storage medium in which computer executable instructions for implementing the voice detection method of the embodiment of the present application are stored. The processor 2002 is configured to perform the above-described voice detection method, as shown in fig. 3, 11, 15, 16, and 18.

In another embodiment, the computer device may also be other computer devices, such as the terminal device 210 shown in FIG. 2. In this embodiment, the structure of the computer device may include, as shown in fig. 21: communication component 2110, memory 2120, display unit 2130, camera 2140, sensor 2150, audio circuitry 2160, bluetooth module 2170, processor 2180, and the like.

The communication component 2110 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module is a short-range wireless transmission technology, and the computer device may help the user to send and receive information through the WiFi module.

Memory 2120 may be used to store software programs and data. The processor 2180 executes various functions and data processing of the terminal device 210 by executing software programs or data stored in the memory 2120. Memory 2120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 2120 stores an operating system that enables the terminal device 210 to operate. The memory 2120 in the present application may store an operating system and various application programs, and may also store a computer program for executing the voice detection method in the embodiment of the present application.

The display unit 2130 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device 210. In particular, the display unit 2130 may include a display screen 2132 disposed on a front side of the terminal device 210. The display 2132 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 2130 may be used to display a voice detection method user interface or the like in the embodiment of the present application.

The display unit 2130 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device 110, and in particular, the display unit 2130 may include a touch screen 2131 disposed on the front of the terminal device 210, may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.

The touch screen 2131 may cover the display screen 2132, or the touch screen 2131 may be integrated with the display screen 2132 to implement input and output functions of the terminal device 210, and after integration, the touch screen may be simply referred to as a touch screen. The display unit 2130 in the present application may display an application program and corresponding operation steps.

The camera 2140 may be used to capture still images, and a user may post comments on the image captured by the camera 2140 through an application. The camera 2140 may be one or more. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the processor 2180 for conversion into a digital image signal.

The terminal device may further comprise at least one sensor 2150, such as an acceleration sensor 2151, a distance sensor 2152, a fingerprint sensor 2153, a temperature sensor 2154. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.

Audio circuitry 2160, speakers 2161, microphone 2162 may provide an audio interface between the user and terminal device 110. The audio circuit 2160 may transmit the received electrical signal converted from audio data to the speaker 2161, and the electrical signal is converted into a sound signal by the speaker 2161 for output. The terminal device 210 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 2162 converts the collected sound signals into electrical signals, which are received by the audio circuit 2160 and converted into audio data, which are output to the communications component 2110 for transmission to, for example, another terminal device 210, or to the memory 2120 for further processing.

The bluetooth module 2170 is used for exchanging information with other bluetooth devices having bluetooth modules through bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable computer device (e.g., a smart watch) also provided with a bluetooth module through the bluetooth module 2170, thereby performing data interaction.

The processor 2180 is a control center of the terminal device, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 2120, and calling data stored in the memory 2120. In some embodiments, the processor 2180 may include one or more processing units; the processor 2180 may also integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., and a baseband processor that primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 2180. The processor 2180 may run an operating system, application programs, user interface displays and touch responses, and voice detection methods of embodiments of the present application. In addition, the processor 2180 is coupled to a display unit 2130.

In some possible embodiments, aspects of the voice detection method provided herein may also be implemented in the form of a program product comprising a computer program for causing a computer device to perform the steps of the voice detection method according to the various exemplary embodiments of the present application described herein above when the program product is run on a computer device, e.g. the computer device may perform the steps as shown in fig. 3, 11, 15, 16, 18.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may run on a computer device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computer device, partly on the user's device, as a stand-alone software package, partly on the user's computer device and partly on a remote computer device or entirely on the remote computer device or server. In the case of remote computer devices, the remote computer device may be connected to the user computer device through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer device (for example, through the Internet using an Internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart(s) and/or block diagram(s).

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart(s) and/or block diagram(s).

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart(s) and/or block diagram(s).

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of voice detection, the method comprising:

extracting features of voice data to be detected to obtain initial voice features, wherein the voice data comprises at least one voice frame;

extracting N historical voice features obtained in a specified historical stage from a specified storage area, wherein the time interval between the specified historical stage and the current moment accords with a preset interval condition, and N is a positive integer;

extracting target context characteristics of the initial voice characteristics contained in the voice characteristic sequence based on M layers of time delay neural networks which are sequentially connected in series; wherein the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features;

when the input data of the delay neural network of any layer in the M-layer delay neural network is a floating point data type, a scaling coefficient is obtained according to the maximum floating point number in the floating point data type input data so as to convert the floating point data type input data into integer data type input data, and based on the integer data type input data, model parameters of the delay neural network are quantized and shifted to obtain integer data type model parameters, and based on the scaling coefficient, output data of the integer data type of the delay neural network is converted into floating point type output data;

2. The method of claim 1, wherein said extracting target contextual features of the initial speech feature contained in the sequence of speech features comprises:

3. The method as recited in claim 2, further comprising:

4. The method of claim 2, wherein prior to extracting the target contextual features of the initial speech feature contained in the sequence of speech features, further comprising:

5. The method as recited in claim 2, further comprising:

then after the i-layer context feature is obtained, further comprising:

6. The method of claim 5, wherein prior to feature extraction of the initial speech feature and the N historical speech features based on the layer 1 network, further comprising:

7. The method of claim 1, wherein the obtaining the target classification of the speech data to be detected based on the target context features comprises:

8. The method according to any of claims 1-7, wherein the method is performed by a target speech detection model, the training process of which is as follows:

9. A voice detection apparatus, the apparatus comprising:

the first feature extraction unit is used for carrying out feature extraction on voice data to be detected to obtain initial voice features, wherein the voice data comprises at least one voice frame;

The data access unit is used for extracting N historical voice features obtained in a specified historical stage from a specified storage area, wherein the time interval between the specified historical stage and the current moment accords with a preset interval condition, and N is a positive integer; sequencing the initial voice features and the N historical voice features according to the acquisition time sequence of the corresponding voice data to obtain a corresponding voice feature sequence;

the second feature extraction unit is used for extracting target context features of the initial voice features contained in the voice feature sequence based on M layers of delay neural networks which are sequentially connected in series; wherein the target context feature characterizes: semantic relationships between the initial speech feature and the N historical speech features;

10. The apparatus of claim 9, wherein the second feature extraction unit is configured to,

11. The apparatus of claim 10, wherein the data access unit is further configured to,

12. The apparatus of claim 10, wherein the second feature extraction unit is further configured to,

13. A computer readable non-volatile storage medium, characterized in that the computer readable non-volatile storage medium stores a program which, when run on a computer, causes the computer to implement the method of any one of claims 1 to 8.

14. A computer device, comprising:

a memory for storing a computer program;

a processor for invoking a computer program stored in said memory, performing the method according to any of claims 1 to 8 in accordance with the obtained program.