CN111243609A - Method and device for intelligently detecting effective voice and computer readable storage medium - Google Patents

Method and device for intelligently detecting effective voice and computer readable storage medium Download PDF

Info

Publication number
CN111243609A
CN111243609A CN202010029673.5A CN202010029673A CN111243609A CN 111243609 A CN111243609 A CN 111243609A CN 202010029673 A CN202010029673 A CN 202010029673A CN 111243609 A CN111243609 A CN 111243609A
Authority
CN
China
Prior art keywords
voice
speech
noise
training
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010029673.5A
Other languages
Chinese (zh)
Other versions
CN111243609B (en
Inventor
马坤
刘微微
赵之砚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010029673.5A priority Critical patent/CN111243609B/en
Publication of CN111243609A publication Critical patent/CN111243609A/en
Priority to PCT/CN2020/112351 priority patent/WO2021139182A1/en
Application granted granted Critical
Publication of CN111243609B publication Critical patent/CN111243609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses an effective voice intelligent detection method, which comprises the following steps: the method comprises the steps of receiving a noise set and a pure human voice set, carrying out voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a label set, inputting the human voice set to a pre-constructed voice coding network to carry out coding operation to obtain a coded human voice set, inputting the coded human voice set to a voice attention network to carry out training to obtain a trained voice attention network, receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set. The invention also provides an effective voice intelligent detection device and a computer readable storage medium. The invention can realize accurate and efficient effective voice intelligent detection function.

Description

Method and device for intelligently detecting effective voice and computer readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for effective voice intelligent detection and a computer readable storage medium.
Background
Voiceprint recognition, which is an important application means in the field of artificial intelligence, refers to recognizing the identity of a speaker through voice. The voiceprint recognition mainly comprises three steps of voice preprocessing, effective voice detection and speaker recognition, and the effective voice detection stage is more important because the voice needs to be judged to judge whether the voice is effective voice or ineffective noise. Most of traditional valid voice detection algorithms (VAD) are based on zero-crossing rate features, energy features or various acoustic feature combinations, and statistical models such as GMM and HMM or Support Vector Machine (SVM) algorithms are adopted to distinguish valid voice from invalid voice (such as noise).
Disclosure of Invention
The invention provides an effective voice intelligent detection method, an effective voice intelligent detection device and a computer readable storage medium, and mainly aims to perform effective voice intelligent detection according to voice data input by a user.
In order to achieve the above object, the present invention provides an effective voice intelligent detection method, which comprises:
receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a tag set;
inputting the voice set into a pre-constructed voice coding network to carry out coding operation to obtain a coded voice set;
inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, quitting training by the voice attention network to obtain a trained voice attention network;
and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
Optionally, the obtaining of the voice set by performing the fusion operation on the noise set and the pure voice set includes:
extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function;
and modifying the voice signals of the noise set and the pure voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the voice set.
Optionally, the autocorrelation function is:
Figure BDA0002362705750000021
wherein R ism(k) Is the autocorrelation function, xω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.
The modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain a short-time signal comprises:
presetting a modification factor α for modifying the speech signal;
changing the fundamental frequency of the speech signal in accordance with the modification factor α while keeping the pitch period unchanged;
the change of the fundamental frequency obtains a plurality of voice signals in adjacent time intervals, and the voice signals in the adjacent time intervals are the short-time signals.
Optionally, inputting the encoded human voice set to a voice attention network for training to obtain a training value, where the training value includes:
the input layer of the voice attention network receives the coded voice set and decodes the coded voice set to obtain a decoded voice set;
inputting the decoded voice set into a hidden layer of the voice attention network for perception calculation and inputting the decoded voice set into an output layer of the voice attention network;
and the output layer carries out attention mechanism calculation to obtain the training value.
In addition, in order to achieve the above object, the present invention further provides an apparatus for intelligently detecting valid voices, including a memory and a processor, where the memory stores a valid voice intelligent detection program operable on the processor, and the processor executes the valid voice intelligent detection program to implement the following steps:
receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a tag set;
inputting the voice set into a pre-constructed voice coding network to carry out coding operation to obtain a coded voice set;
inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, quitting training by the voice attention network to obtain a trained voice attention network;
and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
Optionally, the performing a fusion operation on the noise set and the pure vocal set to obtain a vocal set includes:
extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function;
and modifying the voice signals of the noise set and the pure voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the voice set.
Optionally, the autocorrelation function is:
Figure BDA0002362705750000031
wherein R ism(k) Is the autocorrelation function, xω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.
The modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain a short-time signal comprises:
presetting a modification factor α for modifying the speech signal;
changing the fundamental frequency of the speech signal in accordance with the modification factor α while keeping the pitch period unchanged;
the change of the fundamental frequency obtains a plurality of voice signals in adjacent time intervals, and the voice signals in the adjacent time intervals are the short-time signals.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium, which stores thereon an effective voice smart detection program, which is executable by one or more processors to implement the steps of the effective voice smart detection method as described above.
The invention carries out voice fusion operation on the noise set and the pure voice set through the language autocorrelation function to obtain the voice set and the label set, is favorable for enriching the voice background and improving the learning capacity of the subsequent voice attention network, and simultaneously carries out coding operation based on the pre-constructed voice coding network to obtain the coded voice set, thereby effectively improving the characteristic extraction capacity of the voice, inputting the coded voice set into the voice attention network for training and improving the discrimination capacity of the voice attention network. Therefore, the effective voice intelligent detection method, the device and the computer readable storage medium provided by the invention can realize accurate and efficient effective voice detection function.
Drawings
Fig. 1 is a schematic flow chart of an effective voice intelligent detection method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an internal structure of an effective voice intelligent detection apparatus according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of an effective speech intelligent detection program in the effective speech intelligent detection apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an effective voice intelligent detection method. Fig. 1 is a schematic flow chart of an effective voice intelligent detection method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the effective voice intelligent detection method includes:
and S1, receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a label set.
Preferably, the noise set is background noise pre-selected to capture different real scenes (such as public transportation places, rooms, meeting rooms, stadiums) and the like. The pure voice set is to construct a noise-free background environment in advance, and different voice data (such as voices in Zhang III and Li IV ordinary times) are recorded in the noise-free background environment.
The labelset records whether the sound of each stage in the vocal collection is vocal (also called speed) or noise (called noise), such as the vocal collection of voice a, in which the voice a is described as speed in a period a-b and noise in a period b-c.
Preferably, the fusing the noise set and the pure vocal set to obtain the vocal set includes: extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function; and modifying the voice signals of the noise set and the pure human voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the human voice set.
Further, the autocorrelation function is:
Figure BDA0002362705750000051
wherein R ism(k) Representing said autocorrelation function, xω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.
Modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain short-time signals comprises presetting a modification factor α for modifying the voice signals, changing the fundamental frequency of the voice signals according to the modification factor α under the condition that the pitch period is kept unchanged, obtaining voice signals of a plurality of adjacent time periods through the change of the fundamental frequency, and obtaining the voice signals of the plurality of adjacent time periods as the short-time signals.
Synthesizing the short-time signal into new sound to obtain a human voice set, wherein the calculation method adopted by the synthesis is as follows:
Figure BDA0002362705750000052
wherein,
Figure BDA0002362705750000061
q is the speech signal segment interception period of the short-time signal, t, for the set of human voicesqIs the center of the analysis window of the synthesis, xq(n) is the short-time signal, hq(tq-n) is the analysis window, βqAnd preserving the energy-invariant compensation factor in the synthesis process.
And S2, inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.
The voice coding network can preferably adopt a bidirectional network of a forward network and a backward network based on a gated round robin unit (GRU), and the structure of the bidirectional network can simultaneously code various voice information and improve the amount of the voice information.
The encoding operation includes: receiving the set of voices
Figure BDA0002362705750000062
And presetting a coded voice set (h (1), h (2), …, h (t), …, h (n)), constructing a GRU relationship between the voice set and the coded voice set according to the forward network and the backward network to obtain a forward network GRU relationship and a backward network GRU relationship, and combining the forward network GRU relationship and the backward network GRU relationship to obtain the coded voice set.
Preferably, the forward network GRU relationship and the backward network GRU relationship are respectively:
Figure BDA0002362705750000063
Figure BDA0002362705750000064
wherein, h (t)aFor the forward network GRU relationship, h (t)bAnd the backward network GRU relationship is shown, t is a voice signal, h (t-1) and h (t +1) are respectively preset coding voices, and GRU is a calculation method of the gate control circulation unit.
Preferably, the merging is:
h(t)=GRU(h(t)a,[h(t)a,h(t)b])
wherein h (t) is the set of encoded human voices.
S3, inputting the coded voice set to a voice attention network for training to obtain a training value, inputting the training value and the label set to a pre-constructed target function for calculation to obtain a target value, if the target value is larger than a preset threshold value, continuing training of the voice attention network, and if the target value is smaller than the preset threshold value, exiting training of the voice attention network to obtain a trained voice attention network.
Preferably, the voice attention network comprises an input layer, a hidden layer and an output layer.
Further, the training comprises: the input layer receives the coded voice set and decodes the coded voice set to obtain a decoded voice set, the decoded voice set is input to the hidden layer to be subjected to perception calculation and input to the output layer, and the output layer is subjected to attention mechanism calculation to obtain the training value.
Preferably, the decoding method of the decoding process is:
edecode=ωttanh(W[h(t),st],b)
wherein e isdecodeRepresenting said decoded set of voices, stRepresenting a decoding function, ω, corresponding to an input unit in said input layertA weight matrix from the input layer to the hidden layer, W being a weight of the hidden layer, b being an offset.
Preferably, the calculation method of the perception calculation is as follows:
Figure BDA0002362705750000071
wherein e isperCalculated output values for the sensing, αtIs a hidden unit of the hidden layer.
Preferably, the attention mechanism is calculated as:
Vtraining=attention([eper,h(t)],αt)
wherein, VtrainingFor the training values, attention is the attention mechanism calculation function.
Preferably, the calculation method for obtaining the target value by calculating the target function is as follows:
Figure BDA0002362705750000072
wherein,
Figure BDA0002362705750000073
for the target value, N is the number of speech signal segments, P () is a probability value, VlabelN is the speech signal segment for the set of tags.
After the training, the voice attention network has the capability of recognizing whether the voice is speech or noise.
And S4, receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
Preferably, if a user inputs a recording of the working of wangwu in the workshop, after the judgment of the trained voice attention network, the time period of the recording of the working of wangwu in the workshop is a noise, and the time period is non-noise voice data (such as the voice of wangwu speaking); for example, in criminal investigation, a section of speech of a case sending site is recorded by recording equipment such as a mobile phone, effective speech detection analysis is carried out on the speech of the case sending site through the effective speech detection technology of the invention, so that which speech frames in the speech of the case sending site are effective speech, the extracted effective speech is compared with an existing speech database, the speech identity information of the case sending site is found, and the comparison success rate is improved.
The invention also provides an effective voice intelligent detection device. Fig. 2 is a schematic diagram of an internal structure of an effective voice intelligent detection apparatus according to an embodiment of the present invention.
In this embodiment, the active speech intelligent detection apparatus 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, and a mobile Computer, or may be a server. The intelligent detection device 1 for effective voice at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the active speech intelligent detection apparatus 1, for example a hard disk of the active speech intelligent detection apparatus 1. The memory 11 may also be an external storage device of the valid voice Smart detection apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the valid voice Smart detection apparatus 1. Further, the memory 11 may also include both an internal storage unit of the valid voice smart detection apparatus 1 and an external storage device. The memory 11 may be used to store not only application software installed in the valid speech intelligent detection apparatus 1 and various types of data, such as codes of the valid speech intelligent detection program 01, but also temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the valid voice intelligent detection program 01.
The communication bus 13 is used to realize connection communication between these components.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.
Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the active speech intelligent detection apparatus 1 and for displaying a visual user interface.
While fig. 2 only shows the active speech intelligent detection apparatus 1 with the components 11-14 and the active speech intelligent detection program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the active speech intelligent detection apparatus 1, and may include fewer or more components than those shown, or may combine certain components, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores therein an active speech intelligent detection program 01; the following steps are implemented when the processor 12 executes the valid speech intelligent detection program 01 stored in the memory 11:
and S1, receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a label set.
Preferably, the noise set is background noise pre-selected to capture different real scenes (such as public transportation places, rooms, meeting rooms, stadiums) and the like. The pure voice set is to construct a noise-free background environment in advance, and different voice data (such as voices in Zhang III and Li IV ordinary times) are recorded in the noise-free background environment.
The labelset records whether the sound of each stage in the vocal collection is vocal (also called speed) or noise (called noise), such as the vocal collection of voice a, in which the voice a is described as speed in a period a-b and noise in a period b-c.
Preferably, the fusing the noise set and the pure vocal set to obtain the vocal set includes: extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function; and modifying the voice signals of the noise set and the pure human voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the human voice set.
Further, the autocorrelation function is:
Figure BDA0002362705750000091
wherein R ism(k) Representing said autocorrelation function, xω(N) represents the speech signals of the noise set and the pure vocal set, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure vocal set, K represents the pitch period, and K represents a similarity between a signal of one speech signal delayed by K points and the one speech signal.
Modifying the voice signals of the noise set and the pure human voice set according to the pitch modification algorithm and the pitch period to obtain short-time signals comprises presetting a modification factor α for modifying the voice signals, changing the fundamental frequency of the voice signals according to the modification factor α under the condition that the pitch period is kept unchanged, obtaining voice signals of a plurality of adjacent time periods through the change of the fundamental frequency, and obtaining the voice signals of the plurality of adjacent time periods as the short-time signals.
Synthesizing the short-time signal into new sound to obtain a human voice set, wherein the calculation method adopted by the synthesis is as follows:
Figure BDA0002362705750000101
wherein,
Figure BDA0002362705750000102
q is the speech signal segment interception period of the short-time signal, t, for the set of human voicesqIs the center of the analysis window of the synthesis, xq(n) is the short-time signal, hq(tq-n) is the analysis window, βqAnd preserving the energy-invariant compensation factor in the synthesis process.
And S2, inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.
The voice coding network can preferably adopt a bidirectional network of a forward network and a backward network based on a gated round robin unit (GRU), and the structure of the bidirectional network can simultaneously code various voice information and improve the amount of the voice information.
The encoding operation includes: receiving the set of voices
Figure BDA0002362705750000103
And presetting a coded voice set (h (1), h (2), …, h (t), …, h (n)), constructing a GRU relationship between the voice set and the coded voice set according to the forward network and the backward network to obtain a forward network GRU relationship and a backward network GRU relationship, and combining the forward network GRU relationship and the backward network GRU relationship to obtain the coded voice set.
Preferably, the forward network GRU relationship and the backward network GRU relationship are respectively:
Figure BDA0002362705750000104
Figure BDA0002362705750000105
wherein, h (t)aFor the forward network GRU relationship, h (t)bAnd the backward network GRU relationship is shown, t is a voice signal, h (t-1) and h (t +1) are respectively preset coding voices, and GRU is a calculation method of the gate control circulation unit.
Preferably, the merging is:
h(t)=GRU(h(t)a,[h(t)a,h(t)b])
wherein h (t) is the set of encoded human voices.
S3, inputting the coded voice set to a voice attention network for training to obtain a training value, inputting the training value and the label set to a pre-constructed target function for calculation to obtain a target value, if the target value is larger than a preset threshold value, continuing training of the voice attention network, and if the target value is smaller than the preset threshold value, exiting training of the voice attention network to obtain a trained voice attention network.
Preferably, the voice attention network comprises an input layer, a hidden layer and an output layer.
Further, the training comprises: the input layer receives the coded voice set and decodes the coded voice set to obtain a decoded voice set, the decoded voice set is input to the hidden layer to be subjected to perception calculation and input to the output layer, and the output layer is subjected to attention mechanism calculation to obtain the training value.
Preferably, the decoding method of the decoding process is:
edecode=ωttanh(W[h(t),st],b)
wherein e isdecodeRepresenting said decoded set of voices, stRepresenting a decoding function, ω, corresponding to an input unit in said input layertA weight matrix from the input layer to the hidden layer, W being a weight of the hidden layer, b being an offset.
Preferably, the calculation method of the perception calculation is as follows:
Figure BDA0002362705750000111
wherein e isperCalculated output values for the sensing, αtIs a hidden unit of the hidden layer.
Preferably, the attention mechanism is calculated as:
Vtraining=attention([eper,h(t)],αt)
wherein, VtrainingFor the training values, attention is the attention mechanism calculation function.
Preferably, the calculation method for obtaining the target value by calculating the target function is as follows:
Figure BDA0002362705750000112
wherein,
Figure BDA0002362705750000113
for the target value, N is the number of speech signal segments, P () is a probability value, VlabelN is the speech signal segment for the set of tags.
After the training, the voice attention network has the capability of recognizing whether the voice is speech or noise.
And S4, receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
Preferably, if a user inputs a recording of the working of wangwu in the workshop, after the judgment of the trained voice attention network, the time period of the recording of the working of wangwu in the workshop is a noise, and the time period is non-noise voice data (such as the voice of wangwu speaking); for example, in criminal investigation, a section of speech of a case sending site is recorded by recording equipment such as a mobile phone, effective speech detection analysis is carried out on the speech of the case sending site through the effective speech detection technology of the invention, so that which speech frames in the speech of the case sending site are effective speech, the extracted effective speech is compared with an existing speech database, the speech identity information of the case sending site is found, and the comparison success rate is improved.
Alternatively, in other embodiments, the active speech intelligent detection program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, a schematic diagram of program modules of an intelligent valid speech detection program in an embodiment of the intelligent valid speech detection apparatus according to the present invention is shown, in which the intelligent valid speech detection program can be divided into a data receiving and fusing module 10, a part-of-speech encoding module 20, a speech attention network training module 30, and a valid speech detection module 40. Exemplarily, the following steps are carried out:
the data receiving and fusing module 10 is configured to: and receiving a noise set and a pure voice set, and performing voice fusion operation on the noise set and the pure voice set according to a language autocorrelation function to obtain a voice set and a tag set.
The part-of-speech encoding module 20 is configured to: and inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.
The speech attention network training module 30 is configured to: inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, exiting training by the voice attention network to obtain a trained voice attention network.
The valid voice detection module 40 is configured to: and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
The functions or operation steps implemented by the data receiving and fusing module 10, the part-of-speech encoding module 20, the speech attention network training module 30, the valid speech detecting module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not described herein again.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores an active speech intelligent detection program, where the active speech intelligent detection program is executable by one or more processors to implement the following operations:
and receiving a noise set and a pure voice set, and performing voice fusion operation on the noise set and the pure voice set according to a language autocorrelation function to obtain a voice set and a tag set.
And inputting the voice set into a pre-constructed voice coding network for coding operation to obtain a coded voice set.
Inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, exiting training by the voice attention network to obtain a trained voice attention network.
And receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An intelligent detection method for effective voice, characterized in that the method comprises:
receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a tag set;
inputting the voice set into a pre-constructed voice coding network to carry out coding operation to obtain a coded voice set;
inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, quitting training by the voice attention network to obtain a trained voice attention network;
and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
2. The method for intelligently detecting valid speech according to claim 1, wherein said fusing the noise set and the pure vocal set to obtain a vocal set comprises:
extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function;
and modifying the voice signals of the noise set and the pure voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the voice set.
3. The method for intelligent detection of active speech according to claim 2, wherein said autocorrelation function is:
Figure FDA0002362705740000011
wherein R ism(k) Is the autocorrelation function, xω(n) speech signals representing said noise set and said set of pure human voices, n representing speechAnd a sound signal segment, wherein ω represents a truncation period of the sound signal segment, m represents a window function, N represents a sound signal length of the noise set and the pure human sound set, K represents the pitch period, and K represents a similarity between a signal of a sound signal delayed by K points and the sound signal.
4. The method of claim 2, wherein said modifying the speech signals of the noise set and the pure human speech set according to a pitch modification algorithm and the pitch period results in a short-time signal, comprising:
presetting a modification factor α for modifying the speech signal;
changing the fundamental frequency of the speech signal in accordance with the modification factor α while keeping the pitch period unchanged;
the change of the fundamental frequency obtains a plurality of voice signals in adjacent time intervals, and the voice signals in the adjacent time intervals are the short-time signals.
5. The method according to claim 1, wherein inputting the encoded human voice set into a voice attention network for training to obtain a training value comprises:
the input layer of the voice attention network receives the coded voice set and decodes the coded voice set to obtain a decoded voice set;
inputting the decoded voice set into a hidden layer of the voice attention network for perception calculation and inputting the decoded voice set into an output layer of the voice attention network;
and the output layer carries out attention mechanism calculation to obtain the training value.
6. An active speech intelligent detection apparatus, comprising a memory and a processor, wherein the memory stores an active speech intelligent detection program operable on the processor, and the active speech intelligent detection program when executed by the processor implements the steps of:
receiving a noise set and a pure human voice set, and performing voice fusion operation on the noise set and the pure human voice set according to a language autocorrelation function to obtain a human voice set and a tag set;
inputting the voice set into a pre-constructed voice coding network to carry out coding operation to obtain a coded voice set;
inputting the coded voice set into a voice attention network for training to obtain a training value, inputting the training value and the label set into a pre-constructed objective function for calculation to obtain a target value, if the target value is greater than a preset threshold value, continuing training by the voice attention network, and if the target value is less than the preset threshold value, quitting training by the voice attention network to obtain a trained voice attention network;
and receiving a voice set input by a user, and sequentially inputting the voice set input by the user to the voice coding network and the trained voice attention network to obtain an effective voice detection result of the voice set.
7. The apparatus according to claim 6, wherein said fusing the noise set and the pure vocal set to obtain a vocal set comprises:
extracting pitch periods of the noise set and the pure human voice set based on an autocorrelation function;
and modifying the voice signals of the noise set and the pure voice set according to a pitch modification algorithm and the pitch period to obtain a short-time signal, and synthesizing the short-time signal into new voice to obtain the voice set.
8. The active speech intelligent detection device according to claim 6, wherein the autocorrelation function is:
Figure FDA0002362705740000031
wherein R ism(k) Is the autocorrelation function, xω(n) tableThe speech signal of the noise set and the pure human voice set is shown, N represents a speech signal segment, ω represents a truncation period of the speech signal segment, m represents a window function, N represents a speech signal length of the noise set and the pure human voice set, K represents the pitch period, and K represents the similarity between a signal of one speech signal delayed by K points and the one speech signal.
9. The apparatus for intelligent detection of active speech according to claim 8, wherein said modifying said speech signals for said noise set and said pure human speech set according to a pitch modification algorithm and said pitch period results in a short-time signal comprising:
presetting a modification factor α for modifying the speech signal;
changing the fundamental frequency of the speech signal in accordance with the modification factor α while keeping the pitch period unchanged;
the change of the fundamental frequency obtains a plurality of voice signals in adjacent time intervals, and the voice signals in the adjacent time intervals are the short-time signals.
10. A computer readable storage medium having stored thereon a valid speech smart detection program executable by one or more processors to perform the steps of a valid speech smart detection method according to any one of claims 1 to 5.
CN202010029673.5A 2020-01-10 2020-01-10 Method and device for intelligent detection of effective voice and computer readable storage medium Active CN111243609B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010029673.5A CN111243609B (en) 2020-01-10 2020-01-10 Method and device for intelligent detection of effective voice and computer readable storage medium
PCT/CN2020/112351 WO2021139182A1 (en) 2020-01-10 2020-08-31 Effective intelligent voice detection method and apparatus, device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010029673.5A CN111243609B (en) 2020-01-10 2020-01-10 Method and device for intelligent detection of effective voice and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111243609A true CN111243609A (en) 2020-06-05
CN111243609B CN111243609B (en) 2023-07-14

Family

ID=70880476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010029673.5A Active CN111243609B (en) 2020-01-10 2020-01-10 Method and device for intelligent detection of effective voice and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111243609B (en)
WO (1) WO2021139182A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021139182A1 (en) * 2020-01-10 2021-07-15 平安科技(深圳)有限公司 Effective intelligent voice detection method and apparatus, device and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107481728A (en) * 2017-09-29 2017-12-15 百度在线网络技术(北京)有限公司 Background sound removing method, device and terminal device
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN110136743A (en) * 2019-04-04 2019-08-16 平安科技(深圳)有限公司 Monitoring method of health state, device and storage medium based on sound collection
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
CN110246506A (en) * 2019-05-29 2019-09-17 平安科技(深圳)有限公司 Voice intelligent detecting method, device and computer readable storage medium
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106920545B (en) * 2017-03-21 2020-07-28 百度在线网络技术(北京)有限公司 Speech feature extraction method and device based on artificial intelligence
CN111243609B (en) * 2020-01-10 2023-07-14 平安科技(深圳)有限公司 Method and device for intelligent detection of effective voice and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN107481728A (en) * 2017-09-29 2017-12-15 百度在线网络技术(北京)有限公司 Background sound removing method, device and terminal device
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers
CN110136743A (en) * 2019-04-04 2019-08-16 平安科技(深圳)有限公司 Monitoring method of health state, device and storage medium based on sound collection
CN110246506A (en) * 2019-05-29 2019-09-17 平安科技(深圳)有限公司 Voice intelligent detecting method, device and computer readable storage medium
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021139182A1 (en) * 2020-01-10 2021-07-15 平安科技(深圳)有限公司 Effective intelligent voice detection method and apparatus, device and computer-readable storage medium

Also Published As

Publication number Publication date
CN111243609B (en) 2023-07-14
WO2021139182A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
US10515627B2 (en) Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus
CN111402891B (en) Speech recognition method, device, equipment and storage medium
CN112071322B (en) End-to-end voiceprint recognition method, device, storage medium and equipment
CN108428446A (en) Audio recognition method and device
CN110277088B (en) Intelligent voice recognition method, intelligent voice recognition device and computer readable storage medium
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
WO2020253051A1 (en) Lip language recognition method and apparatus
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN112767917B (en) Speech recognition method, apparatus and storage medium
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
CN111429914B (en) Microphone control method, electronic device and computer readable storage medium
WO2020238046A1 (en) Human voice smart detection method and apparatus, and computer readable storage medium
CN112328761A (en) Intention label setting method and device, computer equipment and storage medium
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
CN113314150A (en) Emotion recognition method and device based on voice data and storage medium
CN107862058A (en) Method and apparatus for generating information
CN110837546A (en) Hidden head pair generation method, device, equipment and medium based on artificial intelligence
CN110570844B (en) Speech emotion recognition method, device and computer readable storage medium
CN111221942B (en) Intelligent text dialogue generation method and device and computer readable storage medium
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN111243609A (en) Method and device for intelligently detecting effective voice and computer readable storage medium
CN112017638A (en) Voice semantic recognition model construction method, semantic recognition method, device and equipment
CN111326142A (en) Text information extraction method and system based on voice-to-text and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant