CN111243609A

CN111243609A - Method and device for intelligently detecting effective voice and computer readable storage medium

Info

Publication number: CN111243609A
Application number: CN202010029673.5A
Authority: CN
Inventors: 马坤; 刘微微; 赵之砚
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-06-05
Anticipated expiration: 2040-01-10
Also published as: CN111243609B; WO2021139182A1

Abstract

The invention relates to an artificial intelligence technology, and discloses an effective voice intelligent detection method, which comprises the following steps: the method comprises the steps of receiving a noise set and a pure human voice set, carrying out voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a label set, inputting the human voice set to a pre-constructed voice coding network to carry out coding operation to obtain a coded human voice set, inputting the coded human voice set to a voice attention network to carry out training to obtain a trained voice attention network, receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set. The invention also provides an effective voice intelligent detection device and a computer readable storage medium. The invention can realize accurate and efficient effective voice intelligent detection function.

Description

Method and device for intelligently detecting effective voice and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for effective voice intelligent detection and a computer readable storage medium.

Background

Voiceprint recognition, which is an important application means in the field of artificial intelligence, refers to recognizing the identity of a speaker through voice. The voiceprint recognition mainly comprises three steps of voice preprocessing, effective voice detection and speaker recognition, and the effective voice detection stage is more important because the voice needs to be judged to judge whether the voice is effective voice or ineffective noise. Most of traditional valid voice detection algorithms (VAD) are based on zero-crossing rate features, energy features or various acoustic feature combinations, and statistical models such as GMM and HMM or Support Vector Machine (SVM) algorithms are adopted to distinguish valid voice from invalid voice (such as noise).

Disclosure of Invention

The invention provides an effective voice intelligent detection method, an effective voice intelligent detection device and a computer readable storage medium, and mainly aims to perform effective voice intelligent detection according to voice data input by a user.

In order to achieve the above object, the present invention provides an effective voice intelligent detection method, which comprises:

receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a tag set;

inputting the voice set into a pre-constructed voice coding network to carry out coding operation to obtain a coded voice set;

inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, quitting training by the voice attention network to obtain a trained voice attention network;

and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.

Optionally, the obtaining of the voice set by performing the fusion operation on the noise set and the pure voice set includes:

extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function;

and modifying the voice signals of the noise set and the pure voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the voice set.

Optionally, the autocorrelation function is:

wherein R is_m(k) Is the autocorrelation function, x_ω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.

The modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain a short-time signal comprises:

presetting a modification factor α for modifying the speech signal;

changing the fundamental frequency of the speech signal in accordance with the modification factor α while keeping the pitch period unchanged;

the change of the fundamental frequency obtains a plurality of voice signals in adjacent time intervals, and the voice signals in the adjacent time intervals are the short-time signals.

Optionally, inputting the encoded human voice set to a voice attention network for training to obtain a training value, where the training value includes:

the input layer of the voice attention network receives the coded voice set and decodes the coded voice set to obtain a decoded voice set;

inputting the decoded voice set into a hidden layer of the voice attention network for perception calculation and inputting the decoded voice set into an output layer of the voice attention network;

and the output layer carries out attention mechanism calculation to obtain the training value.

In addition, in order to achieve the above object, the present invention further provides an apparatus for intelligently detecting valid voices, including a memory and a processor, where the memory stores a valid voice intelligent detection program operable on the processor, and the processor executes the valid voice intelligent detection program to implement the following steps:

Optionally, the performing a fusion operation on the noise set and the pure vocal set to obtain a vocal set includes:

Optionally, the autocorrelation function is:

presetting a modification factor α for modifying the speech signal;

In addition, to achieve the above object, the present invention also provides a computer readable storage medium, which stores thereon an effective voice smart detection program, which is executable by one or more processors to implement the steps of the effective voice smart detection method as described above.

The invention carries out voice fusion operation on the noise set and the pure voice set through the language autocorrelation function to obtain the voice set and the label set, is favorable for enriching the voice background and improving the learning capacity of the subsequent voice attention network, and simultaneously carries out coding operation based on the pre-constructed voice coding network to obtain the coded voice set, thereby effectively improving the characteristic extraction capacity of the voice, inputting the coded voice set into the voice attention network for training and improving the discrimination capacity of the voice attention network. Therefore, the effective voice intelligent detection method, the device and the computer readable storage medium provided by the invention can realize accurate and efficient effective voice detection function.

Drawings

Fig. 1 is a schematic flow chart of an effective voice intelligent detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of an effective voice intelligent detection apparatus according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of an effective speech intelligent detection program in the effective speech intelligent detection apparatus according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an effective voice intelligent detection method. Fig. 1 is a schematic flow chart of an effective voice intelligent detection method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the effective voice intelligent detection method includes:

and S1, receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a label set.

Preferably, the noise set is background noise pre-selected to capture different real scenes (such as public transportation places, rooms, meeting rooms, stadiums) and the like. The pure voice set is to construct a noise-free background environment in advance, and different voice data (such as voices in Zhang III and Li IV ordinary times) are recorded in the noise-free background environment.

The labelset records whether the sound of each stage in the vocal collection is vocal (also called speed) or noise (called noise), such as the vocal collection of voice a, in which the voice a is described as speed in a period a-b and noise in a period b-c.

Preferably, the fusing the noise set and the pure vocal set to obtain the vocal set includes: extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function; and modifying the voice signals of the noise set and the pure human voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the human voice set.

Further, the autocorrelation function is:

wherein R is_m(k) Representing said autocorrelation function, x_ω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.

Modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain short-time signals comprises presetting a modification factor α for modifying the voice signals, changing the fundamental frequency of the voice signals according to the modification factor α under the condition that the pitch period is kept unchanged, obtaining voice signals of a plurality of adjacent time periods through the change of the fundamental frequency, and obtaining the voice signals of the plurality of adjacent time periods as the short-time signals.

Synthesizing the short-time signal into new sound to obtain a human voice set, wherein the calculation method adopted by the synthesis is as follows:

wherein,

q is the speech signal segment interception period of the short-time signal, t, for the set of human voices_qIs the center of the analysis window of the synthesis, x_q(n) is the short-time signal, h_q(t_q-n) is the analysis window, β_qAnd preserving the energy-invariant compensation factor in the synthesis process.

And S2, inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.

The voice coding network can preferably adopt a bidirectional network of a forward network and a backward network based on a gated round robin unit (GRU), and the structure of the bidirectional network can simultaneously code various voice information and improve the amount of the voice information.

The encoding operation includes: receiving the set of voices

And presetting a coded voice set (h (1), h (2), …, h (t), …, h (n)), constructing a GRU relationship between the voice set and the coded voice set according to the forward network and the backward network to obtain a forward network GRU relationship and a backward network GRU relationship, and combining the forward network GRU relationship and the backward network GRU relationship to obtain the coded voice set.

Preferably, the forward network GRU relationship and the backward network GRU relationship are respectively:

wherein, h (t)^aFor the forward network GRU relationship, h (t)^bAnd the backward network GRU relationship is shown, t is a voice signal, h (t-1) and h (t +1) are respectively preset coding voices, and GRU is a calculation method of the gate control circulation unit.

Preferably, the merging is:

h(t)＝GRU(h(t)^a,[h(t)^a,h(t)^b])

wherein h (t) is the set of encoded human voices.

S3, inputting the coded voice set to a voice attention network for training to obtain a training value, inputting the training value and the label set to a pre-constructed target function for calculation to obtain a target value, if the target value is larger than a preset threshold value, continuing training of the voice attention network, and if the target value is smaller than the preset threshold value, exiting training of the voice attention network to obtain a trained voice attention network.

Preferably, the voice attention network comprises an input layer, a hidden layer and an output layer.

Further, the training comprises: the input layer receives the coded voice set and decodes the coded voice set to obtain a decoded voice set, the decoded voice set is input to the hidden layer to be subjected to perception calculation and input to the output layer, and the output layer is subjected to attention mechanism calculation to obtain the training value.

Preferably, the decoding method of the decoding process is:

e_decode＝ω^ttanh(W[h(t),s_t],b)

wherein e is_decodeRepresenting said decoded set of voices, s_tRepresenting a decoding function, ω, corresponding to an input unit in said input layer^tA weight matrix from the input layer to the hidden layer, W being a weight of the hidden layer, b being an offset.

Preferably, the calculation method of the perception calculation is as follows:

wherein e is_perCalculated output values for the sensing, α_tIs a hidden unit of the hidden layer.

Preferably, the attention mechanism is calculated as:

V_training＝attention([e_per,h(t)],α_t)

wherein, V_trainingFor the training values, attention is the attention mechanism calculation function.

Preferably, the calculation method for obtaining the target value by calculating the target function is as follows:

wherein,

for the target value, N is the number of speech signal segments, P () is a probability value, V_labelN is the speech signal segment for the set of tags.

After the training, the voice attention network has the capability of recognizing whether the voice is speech or noise.

And S4, receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.

Preferably, if a user inputs a recording of the working of wangwu in the workshop, after the judgment of the trained voice attention network, the time period of the recording of the working of wangwu in the workshop is a noise, and the time period is non-noise voice data (such as the voice of wangwu speaking); for example, in criminal investigation, a section of speech of a case sending site is recorded by recording equipment such as a mobile phone, effective speech detection analysis is carried out on the speech of the case sending site through the effective speech detection technology of the invention, so that which speech frames in the speech of the case sending site are effective speech, the extracted effective speech is compared with an existing speech database, the speech identity information of the case sending site is found, and the comparison success rate is improved.

The invention also provides an effective voice intelligent detection device. Fig. 2 is a schematic diagram of an internal structure of an effective voice intelligent detection apparatus according to an embodiment of the present invention.

In this embodiment, the active speech intelligent detection apparatus 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, and a mobile Computer, or may be a server. The intelligent detection device 1 for effective voice at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the active speech intelligent detection apparatus 1, for example a hard disk of the active speech intelligent detection apparatus 1. The memory 11 may also be an external storage device of the valid voice Smart detection apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the valid voice Smart detection apparatus 1. Further, the memory 11 may also include both an internal storage unit of the valid voice smart detection apparatus 1 and an external storage device. The memory 11 may be used to store not only application software installed in the valid speech intelligent detection apparatus 1 and various types of data, such as codes of the valid speech intelligent detection program 01, but also temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the valid voice intelligent detection program 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the active speech intelligent detection apparatus 1 and for displaying a visual user interface.

While fig. 2 only shows the active speech intelligent detection apparatus 1 with the components 11-14 and the active speech intelligent detection program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the active speech intelligent detection apparatus 1, and may include fewer or more components than those shown, or may combine certain components, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores therein an active speech intelligent detection program 01; the following steps are implemented when the processor 12 executes the valid speech intelligent detection program 01 stored in the memory 11:

Further, the autocorrelation function is:

wherein,

The encoding operation includes: receiving the set of voices

Preferably, the merging is:

h(t)＝GRU(h(t)^a,[h(t)^a,h(t)^b])

wherein h (t) is the set of encoded human voices.

Preferably, the decoding method of the decoding process is:

e_decode＝ω^ttanh(W[h(t),s_t],b)

Preferably, the calculation method of the perception calculation is as follows:

Preferably, the attention mechanism is calculated as:

V_training＝attention([e_per,h(t)],α_t)

wherein,

Alternatively, in other embodiments, the active speech intelligent detection program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of an intelligent valid speech detection program in an embodiment of the intelligent valid speech detection apparatus according to the present invention is shown, in which the intelligent valid speech detection program can be divided into a data receiving and fusing module 10, a part-of-speech encoding module 20, a speech attention network training module 30, and a valid speech detection module 40. Exemplarily, the following steps are carried out:

the data receiving and fusing module 10 is configured to: and receiving a noise set and a pure voice set, and performing voice fusion operation on the noise set and the pure voice set according to a language autocorrelation function to obtain a voice set and a tag set.

The part-of-speech encoding module 20 is configured to: and inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.

The speech attention network training module 30 is configured to: inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, exiting training by the voice attention network to obtain a trained voice attention network.

The valid voice detection module 40 is configured to: and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.

The functions or operation steps implemented by the data receiving and fusing module 10, the part-of-speech encoding module 20, the speech attention network training module 30, the valid speech detecting module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not described herein again.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores an active speech intelligent detection program, where the active speech intelligent detection program is executable by one or more processors to implement the following operations:

and receiving a noise set and a pure voice set, and performing voice fusion operation on the noise set and the pure voice set according to a language autocorrelation function to obtain a voice set and a tag set.

And inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.

Inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, exiting training by the voice attention network to obtain a trained voice attention network.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An intelligent detection method for effective voice, characterized in that the method comprises:

2. The method for intelligently detecting valid speech according to claim 1, wherein said fusing the noise set and the pure vocal set to obtain a vocal set comprises:

3. The method for intelligent detection of active speech according to claim 2, wherein said autocorrelation function is:

wherein R is_m(k) Is the autocorrelation function, x_ω(n) speech signals representing said noise set and said set of pure human voices, n representing speechAnd a sound signal segment, wherein ω represents a truncation period of the sound signal segment, m represents a window function, N represents a sound signal length of the noise set and the pure human sound set, K represents the pitch period, and K represents a similarity between a signal of a sound signal delayed by K points and the sound signal.

4. The method of claim 2, wherein said modifying the speech signals of the noise set and the pure human speech set according to a pitch modification algorithm and the pitch period results in a short-time signal, comprising:

presetting a modification factor α for modifying the speech signal;

5. The method according to claim 1, wherein inputting the encoded human voice set into a voice attention network for training to obtain a training value comprises:

6. An active speech intelligent detection apparatus, comprising a memory and a processor, wherein the memory stores an active speech intelligent detection program operable on the processor, and the active speech intelligent detection program when executed by the processor implements the steps of:

7. The apparatus according to claim 6, wherein said fusing the noise set and the pure vocal set to obtain a vocal set comprises:

8. The active speech intelligent detection device according to claim 6, wherein the autocorrelation function is:

wherein R is_m(k) Is the autocorrelation function, x_ω(n) tableThe speech signal of the noise set and the pure human voice set is shown, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure human voice set, K represents the pitch period, and K represents the similarity between a signal of one speech signal delayed by K points and the one speech signal.

9. The apparatus for intelligent detection of active speech according to claim 8, wherein said modifying said speech signals for said noise set and said pure human speech set according to a pitch modification algorithm and said pitch period results in a short-time signal comprising:

presetting a modification factor α for modifying the speech signal;

10. A computer readable storage medium having stored thereon a valid speech smart detection program executable by one or more processors to perform the steps of a valid speech smart detection method according to any one of claims 1 to 5.