CN111863023A

CN111863023A - Voice detection method and device, computer equipment and storage medium

Info

Publication number: CN111863023A
Application number: CN202010999871.4A
Authority: CN
Inventors: 丁俊豪; 彭子娇
Original assignee: Voiceai Technologies Co ltd
Current assignee: Shenzhen Digital Miracle Technology Co ltd; Voiceai Technologies Co ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-10-30
Anticipated expiration: 2040-09-22
Also published as: CN111863023B

Abstract

The application relates to a voice detection method, a voice detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring audio data, and extracting waveform characteristics of the audio data to obtain a waveform width characteristic sequence of the audio data; acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence, and performing matching detection according to the waveform width characteristic of the sliding overlapping window; when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data obtained through sliding overlapping window detection, determining the position information of each group of characteristic sequence segments with the same waveform width; in the audio data, determining corresponding groups of audio data segments according to the position information of the groups of characteristic sequence segments with the same waveform width; and respectively verifying each group of audio data fragments, and taking the audio data fragments successfully verified as voice copy fragments in the audio data. The method can improve the detection efficiency of the voice copy segment.

Description

Voice detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for voice detection, a computer device, and a storage medium.

Background

With the development of digital audio technology, people make modification operations on audio data easier, and copy and paste audio clips are one of the simplest ways to modify audio. Some lawbreakers use the method to maliciously tamper with the audio to forge the recording evidence, which increases the difficulty for the case handling personnel to carry out audio evidence collection and seriously hinders justice. Therefore, in such cases, it is important to perform duplication detection on audio data.

In order to ensure the detection accuracy, the traditional voice copy detection method needs to perform non-missing copy segment matching detection on all voice sample data under the condition of no copy segment prior information, the calculation amount is huge, and particularly for longer audio, the detection method is time-consuming and causes low detection efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide a voice detection method, apparatus, computer device and storage medium capable of improving detection efficiency.

A method of speech detection, the method comprising:

acquiring audio data, and extracting waveform characteristics of the audio data to obtain a waveform width characteristic sequence of the audio data;

acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence, and performing matching detection according to the waveform width characteristic of the sliding overlapping window;

when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data detected by the sliding overlapping window, determining the position information of each group of characteristic sequence segments with the same waveform width;

in the audio data, determining corresponding groups of audio data fragments according to the position information of the groups of characteristic sequence fragments with the same waveform width;

and respectively verifying each group of audio data fragments, and taking the audio data fragments successfully verified as voice copy fragments in the audio data.

In one embodiment, the extracting the waveform feature of the audio data to obtain a waveform width feature sequence of the audio data includes:

dividing the audio data into sub-waveforms according to the values and the continuity of the sampling points of the audio data, wherein the waveform width of each sub-waveform is defined according to the number of the sampling points, and the waveform direction is defined according to the values of the sampling points, and comprises a positive waveform and a negative waveform;

counting the number of sampling points corresponding to each sub-waveform to obtain the waveform width characteristic corresponding to each sub-waveform;

and obtaining the waveform width characteristic sequence according to the waveform direction of each sub-waveform and the waveform width characteristic corresponding to each sub-waveform, wherein the waveform width characteristic sequence comprises a positive waveform width characteristic sequence, a negative waveform width characteristic sequence and a bidirectional waveform width characteristic sequence.

extracting a plurality of forward waveforms from the audio data according to each sampling point of which the value is greater than a preset threshold value of the forward waveform, and counting the number of sampling points corresponding to each forward waveform to obtain a forward waveform width characteristic sequence;

extracting a plurality of negative waveforms from the audio data according to each sampling point of which the value is smaller than a preset threshold value of the positive waveform, and counting the number of sampling points corresponding to each negative waveform to obtain a negative waveform width characteristic sequence;

and counting the number of sampling points corresponding to each positive waveform and the number of sampling points corresponding to each negative waveform to obtain a bidirectional waveform width characteristic sequence.

In one embodiment, the obtaining a sliding overlapping window corresponding to the waveform width feature sequence, and performing matching detection according to the waveform width feature of the sliding overlapping window includes:

acquiring a waveform width characteristic copy sequence; the waveform width feature replication sequence and the waveform width feature sequence are the same sequence;

connecting the waveform width characteristic sequence and the waveform width characteristic copy sequence end to end, starting to slide in opposite directions, and taking an overlapped area of the waveform width characteristic sequence and the waveform width characteristic copy sequence in the sliding process as a sliding overlapped window;

in a current sliding overlapping window, calculating the difference value between a first sub-feature sequence corresponding to the waveform width feature sequence and a second sub-feature sequence corresponding to the waveform width feature copy sequence to obtain a waveform width feature difference value sequence corresponding to the current sliding overlapping window;

and acquiring a first sub-characteristic sequence segment and a second sub-characteristic sequence segment corresponding to the segment position which accords with a preset difference value in the waveform width characteristic difference value sequence, and taking the first sub-characteristic sequence segment and the second sub-characteristic sequence segment as the same waveform width characteristic sequence segment.

In one embodiment, the verifying the groups of audio data segments separately, and taking the successfully verified audio data segment as a voice copy segment in the audio data includes:

in the same group of audio data segments, when the value of each sampling point to be matched of the current audio segment is correspondingly equal to the value of each sampling point to be matched of other audio segments, determining that the current audio segment and the other audio segments are a group of voice copy segments, and taking each group of voice copy segments as the voice copy segments in the audio data.

In one embodiment, the verifying the groups of audio data segments separately and taking the successfully verified audio data segment as a voice copy segment in the audio data includes:

in the same group of audio data segments, when the value of each sampling point to be matched of the current audio segment is in a proportional relation with the value of each sampling point to be matched of other audio segments, determining that the current audio segment and the other audio segments are a group of voice copy segments, and taking each group of voice copy segments as the voice copy segments in the audio data.

In one embodiment, after determining that the current audio segment and the other audio segments are a set of voice duplicate segments, the method further comprises:

acquiring adjacent sampling points of the current audio clip and adjacent sampling points of other audio clips in the audio data;

correspondingly matching adjacent sampling points of the current audio clip with adjacent sampling points of the other audio clips;

when the matching is successful, combining the current audio clip and the adjacent sampling point of the current audio clip to obtain an expanded current audio clip, and combining the other audio clips and the adjacent sampling points of the other audio clips to obtain other expanded audio clips;

and taking the expanded current audio segment and the expanded other audio segments as a group of voice copy segments.

A speech detection apparatus, the apparatus comprising:

the characteristic extraction module is used for acquiring audio data and extracting waveform characteristics of the audio data to obtain a waveform width characteristic sequence of the audio data;

the characteristic matching module is used for acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence and carrying out matching detection according to the waveform width characteristics of the sliding overlapping window; when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data detected by the sliding overlapping window, determining the position information of each group of characteristic sequence segments with the same waveform width;

an audio segment extracting module, configured to determine, in the audio data, each corresponding group of audio data segments according to the position information of each group of feature sequence segments with the same waveform width;

and the audio segment matching module is used for respectively verifying each group of audio data segments and taking the audio data segments which are successfully verified as voice copy segments in the audio data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the voice detection method, the voice detection device, the computer equipment and the storage medium, the audio data are obtained, and the waveform feature extraction is carried out on the audio data to obtain the waveform width feature sequence of the audio data; acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence, and performing matching detection according to the waveform width characteristic of the sliding overlapping window; when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data obtained through sliding overlapping window detection, determining the position information of each group of characteristic sequence segments with the same waveform width; in the audio data, determining corresponding groups of audio data segments according to the position information of the groups of characteristic sequence segments with the same waveform width; and respectively verifying each group of audio data fragments, and taking the audio data fragments successfully verified as voice copy fragments in the audio data. Different from the traditional scheme of carrying out copy segment matching detection on all voice sampling data, the method carries out pre-detection on the audio data through the extracted waveform width characteristic sequence, extracts the audio segment which is probably the copy segment from the audio data, and then further verifies the audio segment, so that the detection accuracy is ensured, meanwhile, the calculation amount of matching detection is reduced, the detection time is shortened, and the detection efficiency is improved.

Drawings

FIG. 1 is a flow diagram illustrating a method for speech detection in one embodiment;

FIG. 2 is a waveform diagram of audio data in one embodiment;

FIG. 3 is a schematic flow chart diagram illustrating a method for generating a waveform width signature sequence in one embodiment;

FIG. 4 is a schematic flow chart diagram illustrating a method for generating a waveform width signature sequence in accordance with another embodiment;

FIG. 5 is a flow diagram illustrating matching detection based on sliding overlapping windows, in accordance with an embodiment;

FIG. 6 is a diagram illustrating match detection based on sliding overlapping windows in one embodiment;

FIG. 7 is a flowchart illustrating a method for verifying audio data segments according to an embodiment;

FIG. 8 is a flow chart illustrating a voice detection method according to another embodiment;

FIG. 9 is a block diagram showing the structure of a speech detecting apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a voice detection method is provided, and this embodiment is illustrated by applying this method to a terminal, and it is to be understood that this method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

and 102, acquiring audio data, and extracting waveform characteristics of the audio data to obtain a waveform width characteristic sequence of the audio data.

The waveform characteristics refer to statistical characteristics of time domain waveforms of the audio signals. The waveform width feature sequence refers to a sequence composed of waveform width features. The waveform width characteristic may be a time width value of each waveform in the audio data signal, or may be a number of sampling points in each waveform.

Specifically, the terminal may obtain the audio data through an audio acquisition device. The terminal may perform waveform feature extraction on the audio data. The waveform width feature sequence may be a specific direction waveform width feature sequence, that is, a waveform width feature sequence in a specific direction, such as a positive direction and a negative direction. And the terminal samples the audio data to obtain the value of each sampling point. As shown in the waveform diagram of the audio data shown in fig. 2, the terminal may divide the waveform of the audio data signal according to the values of the sampling points, and set the sampling points whose values are continuously greater than the preset threshold of the forward waveform as the forward waveform; and taking a sampling point set of which the value is continuously smaller than a preset threshold value of the negative waveform as the negative waveform. The preset threshold for determining the waveform direction may be flexibly set according to the statistical characteristics of the audio data sampling points, for example, an average value of all positive sampling point values of the audio data is calculated and then multiplied by a coefficient to serve as the preset threshold of the positive waveform, an average value of all negative sampling point values is multiplied by a coefficient to serve as the preset threshold of the negative waveform, an intermediate value of all positive sampling point values of the audio data is calculated and then multiplied by a coefficient to serve as the preset threshold of the positive waveform, an intermediate value of all negative sampling point values is multiplied by a coefficient to serve as the preset threshold of the negative waveform, other positive values smaller than the maximum positive sampling point value of the waveform may be set as the preset threshold of the positive waveform, and negative values larger than the minimum negative sampling point value of the waveform may be set as the preset threshold of the negative waveform, or both the positive and negative preset. Further, the terminal counts the number of sampling points of each waveform, and the number of sampling points of each waveform is used as the width of each waveform to obtain a waveform width characteristic sequence of the audio data.

In one embodiment, the waveform width signature sequence may be a single forward waveform width signature sequence that includes a waveform width corresponding to each forward waveform in the audio data. For example, the waveform width feature sequence is [32, 19, … … ]. The waveform width signature sequence may also be an individual negative-going waveform width signature sequence that includes a waveform width corresponding to each negative-going waveform in the audio data. For example, the waveform width feature sequence is [ -26, -30, … … ]. It will be appreciated that to distinguish the widths of the positive and negative waveforms herein, a negative sign is added to the negative waveform width, with the actual width being the absolute value of the negative number. The waveform width feature sequence may also be a bidirectional waveform width feature sequence, where the bidirectional waveform width feature sequence includes waveform widths corresponding to positive and negative waveforms in the audio data, and the waveform widths are arranged according to an appearance sequence of the corresponding waveforms. For example, the bidirectional waveform width signature sequence is [32, -26, 19, -30, … … ].

And 104, acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence, and performing matching detection according to the waveform width characteristic of the sliding overlapping window.

When two identical waveform width characteristic sequences are connected end to end and slide in opposite directions, the overlapping area of the two waveform width characteristic sequences is a sliding overlapping window.

Specifically, the terminal makes two identical waveform width feature sequences end to end and slide in opposite directions, determines a sliding overlap window corresponding to the waveform width feature sequences according to an overlap region of the two waveform width feature sequences, and performs matching detection according to two waveform width feature subsequences corresponding to the sliding overlap window, wherein the waveform width feature subsequences are parts of the waveform width feature sequences in the sliding overlap window.

And 106, when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data obtained through the sliding overlapping window detection, determining the position information of each group of characteristic sequence segments with the same waveform width.

Wherein, the waveform width characteristic sequence segment refers to a segment in the waveform width characteristic sequence. The set of identical waveform width signature sequence segments includes two waveform width signature sequence segments, each waveform width signature value of each waveform width signature sequence segment being equal. During the sliding process, a plurality of sliding overlapping windows can be obtained. One or more groups of characteristic sequence segments with the same waveform width can be detected by the same sliding overlapping window. The same sliding overlapping window can detect that all groups of characteristic sequence segments with the same waveform width can be same or different. Different sliding overlapping windows can be respectively detected to obtain one or more groups of characteristic sequence segments with the same waveform width. The sets of identical waveform width signature sequence segments detected by different sliding overlapping windows may be identical or different. The position information of the waveform width feature sequence segment includes at least one of a start position and an end position of the waveform width feature sequence segment in the waveform width feature sequence and a corresponding start position and end position in the audio data. When the position information of the waveform width characteristic sequence segment is the corresponding starting position and the corresponding ending position of the waveform width characteristic sequence segment in the audio data, the starting position is determined by the first waveform corresponding to the waveform width characteristic sequence segment, and the ending position is determined by the last waveform corresponding to the waveform width characteristic sequence segment.

Specifically, the terminal performs matching detection on the waveform width feature sequence of the audio data through a sliding overlapping window, determines the starting position and the ending position of each group of identical waveform width feature sequence segments in the waveform width feature sequence when detecting that at least one group of identical waveform width feature sequence segments exist in the audio data, and further determines the starting position and the ending position of each group of identical waveform width feature sequence segments in the audio data. For example, the waveform width feature sequence of the audio data is [24, 32, -25, 17, 28, 37, 24, 32, -25, 17], and a group of identical waveform width feature sequence segments [24, 32, -25, 17] is detected. The terminal can determine that the starting position of the first waveform width characteristic sequence segment [24, 32, -25, 17] in the audio data is a first waveform, and the ending position is a fourth waveform; the start position of the second waveform width signature sequence segment [24, 32, -25, 17] in the audio data is the seventh waveform and the end position is the tenth waveform. It is to be understood that when the waveform width signature sequence is a particular direction waveform width signature sequence, the start and end positions refer to the particular direction waveform of the corresponding position.

And 108, determining corresponding groups of audio data segments according to the position information of the groups of the characteristic sequence segments with the same waveform width in the audio data.

Specifically, after determining the position information of each group of feature sequence segments with the same waveform width, the terminal may extract each group of audio data segments corresponding to each group of feature sequence segments with the same waveform width from the audio data according to each group of position information. For example, the audio data segments corresponding to the starting point of the first forward waveform to the end point of the fourth forward waveform are extracted from the audio data.

And step 110, respectively checking each group of audio data fragments, and taking the audio data fragments successfully checked as voice copy fragments in the audio data.

The voice copy segment refers to a voice segment obtained by copying and pasting in the audio data. The method specifically comprises the steps of directly copying and pasting the obtained voice segments, and copying, scaling in equal proportion and then pasting the obtained voice segments.

Specifically, the terminal checks each extracted group of audio data fragments respectively. And in the same group of audio data fragments, the terminal verifies the values of the corresponding sampling points among the audio data fragments. And the terminal takes each group of audio data fragments successfully verified as voice copy fragments in the audio data.

In one embodiment, when the voice copied fragment is a voice fragment obtained by directly copying and pasting, the condition of successful verification may be that the values of the sampling points of the respective audio data fragments are equal in the same position. For example, a group of audio data segments includes an audio data segment a (1, 2, 3, 4, 5) and an audio data segment B as (1, 2, 3, 4, 5), and since the values of the sampling points at the same positions of the audio data segment a (1, 2, 3, 4, 5) and the audio data segment B as (1, 2, 3, 4, 5) are equal, the audio data segment a and the audio data segment B can be regarded as voice copy segments in the audio data. When the voice copied fragment is obtained by copying, scaling with equal ratio and pasting, the condition of successful verification can be that the values of the sampling points of the audio data fragments are in a certain ratio under the same position. For example, a group of audio data segments includes audio data segment a of (1, 2, 3, 4, 5) and audio data segment B of (2, 4, 6,8, 10), and the same-position sampling points of audio data segment a of (1, 2, 3, 4, 5) and audio data segment B of (2, 4, 6,8, 10) are in the same ratio, so that audio data segment a and audio data segment B can be regarded as voice copy segments in audio data.

In the voice detection method, the audio data is obtained, and the waveform feature extraction is carried out on the audio data to obtain a waveform width feature sequence of the audio data; acquiring a sliding overlapping window corresponding to the waveform width characteristic sequence, and performing matching detection according to the waveform width characteristic of the sliding overlapping window; when at least one group of characteristic sequence segments with the same waveform width exist in the waveform width characteristic sequence of the audio data obtained through sliding overlapping window detection, determining the position information of each group of characteristic sequence segments with the same waveform width; in the audio data, determining corresponding groups of audio data segments according to the position information of the groups of characteristic sequence segments with the same waveform width; and respectively verifying each group of audio data fragments, and taking the audio data fragments successfully verified as voice copy fragments in the audio data. Different from the traditional scheme of carrying out copy segment matching detection on all voice sampling data, the method carries out pre-detection on the audio data through the extracted waveform width characteristic sequence, extracts the audio segment which is probably the copy segment from the audio data, and then further verifies the audio segment, so that the detection accuracy is ensured, meanwhile, the calculation amount of matching detection is reduced, the detection time is shortened, and the detection efficiency is improved.

In one embodiment, as shown in fig. 3, performing waveform feature extraction on the audio data to obtain a waveform width feature sequence of the audio data includes:

step 302, dividing the audio data into sub-waveforms according to the values and the continuity of the sampling points of the audio data, wherein each sub-waveform defines the waveform width according to the number of the sampling points, and defines the waveform direction according to the values of the sampling points, and the waveform direction comprises a positive waveform and a negative waveform.

And step 304, counting the number of sampling points corresponding to each sub-waveform to obtain the waveform width characteristic corresponding to each sub-waveform.

And step 306, obtaining a waveform width characteristic sequence according to the waveform direction of each sub-waveform and the waveform width characteristic corresponding to each sub-waveform, wherein the waveform width characteristic sequence comprises a positive waveform width characteristic sequence, a negative waveform width characteristic sequence and a bidirectional waveform width characteristic sequence.

Wherein a sub-waveform refers to a wave in the audio data signal. The waveform direction is determined by the values of the sampling points, and the waveform direction comprises a positive waveform and a negative waveform. And when the value of each sampling point in the sub-waveform is greater than the preset threshold value of the forward waveform, the waveform direction of the sub-waveform is the forward waveform. And when the value of each sampling point in the sub-waveform is smaller than the negative-direction waveform preset threshold value, the waveform direction of the sub-waveform is a negative-direction waveform.

Specifically, after the terminal acquires the audio data, the audio data may be divided into sub-waveforms according to values of respective sampling points of the audio data and continuity of the sampling points. As shown in fig. 2, according to the values of the sampling points and the continuity thereof, the terminal may divide the audio data into a plurality of sub-waveforms, may set the preset thresholds of the positive and negative waveforms to be 0, form the sampling points whose values are continuously greater than 0 into a positive waveform, and form the sampling points whose values are continuously less than 0 into a negative waveform. And the terminal counts the number of sampling points corresponding to each sub-waveform to obtain the waveform width characteristic corresponding to each sub-waveform. For example, if a sub-waveform is a forward waveform, the sub-waveform includes 25 samples, and the sub-waveform has a waveform width characteristic of 25.

And the terminal obtains a waveform width characteristic sequence according to the waveform direction of each sub-waveform and the waveform width characteristic corresponding to each sub-waveform. In order to distinguish a positive waveform from a negative waveform, when the waveform is the positive waveform, a positive sign can be added in front of the number of sampling points corresponding to the sub-waveform; when the waveform is a negative waveform, a negative sign can be added in front of the number of sampling points corresponding to the sub-waveform. In other embodiments, the terminal may distinguish between positive and negative going waveforms in other ways.

In this embodiment, the waveform width feature sequence of the audio data can be extracted by the value and the number of sampling points of each sampling point of the audio data, and an audio segment that may be a voice copy segment in the audio data can be quickly detected by the waveform width feature sequence, so that the amount of calculation for detecting the voice copy segment is reduced, and the efficiency for detecting the voice copy segment is improved.

In one embodiment, as shown in fig. 4, the performing waveform feature extraction on the audio data to obtain a waveform width feature sequence of the audio data includes:

step 402, extracting a plurality of forward waveforms from the audio data according to each sampling point of which the value is greater than a preset threshold of the forward waveform, and counting the number of sampling points corresponding to each forward waveform to obtain a forward waveform width characteristic sequence.

Specifically, after the audio data is acquired, the terminal compares the value of each sampling point with a preset threshold of the forward waveform, and can extract a plurality of forward waveforms from the audio data. Namely, each sampling point continuously larger than the preset threshold value of the forward waveform in the audio data forms a forward waveform of the audio data. And the terminal counts the number of sampling points in each forward waveform, and takes the number of sampling points corresponding to each forward waveform as a forward waveform width characteristic sequence. For example, the forward waveform width signature sequence is [31, 23, 18, … … ].

And step 404, extracting a plurality of negative waveforms from the audio data according to each sampling point of which the value is smaller than the preset threshold of the negative waveform, and counting the number of sampling points corresponding to each negative waveform to obtain a negative waveform width characteristic sequence.

Specifically, after the terminal acquires the audio data, the values of the sampling points are compared with a preset negative-going waveform threshold, and a plurality of negative-going waveforms can be extracted from the audio data. That is, the sampling points continuously smaller than the preset threshold of the negative waveform in the audio data constitute a negative waveform of the audio data. And the terminal counts the number of sampling points in each negative-going waveform, and the number of sampling points corresponding to each negative-going waveform is used as the width characteristic of the negative-going waveform. For example, the negative waveform width signature sequence is [ -25, -33, -28, … … ].

And 406, counting the number of sampling points corresponding to each positive waveform and the number of sampling points corresponding to each negative waveform to obtain a bidirectional waveform width characteristic sequence.

Specifically, after the terminal acquires the audio data, the values of the sampling points are respectively compared with the preset positive and negative waveforms threshold values, and a plurality of positive waveforms and negative waveforms can be extracted from the audio data. And the terminal counts the number of sampling points in each positive waveform and each negative waveform, and the number of sampling points corresponding to each positive waveform and each negative waveform is used as the width characteristic of the bidirectional waveform. For example, the bidirectional waveform width signature sequence is [31, 25, -33, -28, 11, -15, … … ].

In this embodiment, the positive waveform width feature sequence, the negative waveform width feature sequence, and the bidirectional waveform width feature sequence of the audio data can be extracted and obtained by the values of the respective sampling points of the audio data and the number of the sampling points. The audio segment which is probably a voice copy segment in the audio data can be quickly detected through any waveform width characteristic sequence, the calculation amount of the voice copy segment detection is reduced, and the efficiency of the voice copy segment detection is improved.

In one embodiment, as shown in fig. 5, acquiring a sliding overlap window corresponding to a waveform width feature sequence, and performing matching detection according to the waveform width feature of the sliding overlap window includes:

step 502, obtaining a waveform width characteristic replication sequence; the waveform width feature copy sequence and the waveform width feature sequence are the same sequence.

And step 504, connecting the waveform width characteristic sequence and the waveform width characteristic copy sequence end to end, starting to slide in opposite directions, and taking an overlapping area of the waveform width characteristic sequence and the waveform width characteristic copy sequence in the sliding process as a sliding overlapping window.

Step 506, in the current sliding overlapping window, calculating a difference value between a first sub-feature sequence corresponding to the waveform width feature sequence and a second sub-feature sequence corresponding to the waveform width feature copy sequence to obtain a waveform width feature difference value sequence corresponding to the current sliding overlapping window.

And step 508, acquiring a first sub-feature sequence segment and a second sub-feature sequence segment corresponding to the segment position meeting a preset difference value in the waveform width feature difference value sequence, and taking the first sub-feature sequence segment and the second sub-feature sequence segment as the feature sequence segments with the same waveform width.

Specifically, the terminal can copy the waveform width characteristic sequence to obtain a waveform width characteristic copy sequence, the waveform width characteristic sequence and the waveform width characteristic copy sequence are connected end to end and start to slide in opposite directions, and the sliding overlap window is an overlap region of the waveform width characteristic sequence and the waveform width characteristic copy sequence in the sliding process. Referring to fig. 6, the overlapping portion of the signature sequence and the copy of the signature sequence is a sliding overlapping window. In the sliding overlapping window, the terminal calculates the difference value between a first sub-feature sequence corresponding to the waveform width feature sequence and a second sub-feature sequence corresponding to the waveform width feature copy sequence to obtain the waveform width feature difference value sequence of the current sliding overlapping window. Referring to fig. 6, the subsequence corresponding to the feature sequence in the sliding overlap window is subtracted from the subsequence corresponding to the copy of the feature sequence to obtain a feature difference sequence. The terminal obtains a first sub-characteristic sequence segment and a second sub-characteristic sequence segment corresponding to the segment position which accords with a preset difference value in the waveform width characteristic difference value sequence, and the first sub-characteristic sequence segment and the second sub-characteristic sequence segment are used as a group of same waveform width characteristic sequence segments. Referring to fig. 6, the preset difference may be set to 0, all zero segments included in the feature difference sequence are segments equal to the preset difference 0 in the feature difference sequence, the length of the segments may be 1 or more, and the sub-sequence segments in the feature sequence corresponding to all zero segments in the feature difference sequence and the sub-sequence segments in the feature sequence copy are taken as a group of feature sequence segments with the same waveform width.

It will be appreciated that during the sliding process, multiple sliding overlapping windows may be obtained. Different segments which accord with preset difference values can exist in the waveform width characteristic difference value sequences of different sliding overlapping windows respectively. There may be more than one segment that matches the predetermined difference in the waveform width characteristic difference sequence of the same sliding overlapping window. For example, the waveform width feature sequence of the audio data is [24, 32, 23, 16, 28, 37, 24, 32, 25, 17, 32, 28, 37, 15, 29, 25, 17, 32, 13], and the waveform width feature sequence a ([ 24, 32, 23, 16, 28, 37, 24, 32, 25, 17, 32, 28, 37, 15, 29, 25, 17, 32, 13 ]) and the waveform width feature copy sequence B ([ 24, 32, 23, 16, 28, 37, 24, 32, 25, 17, 32, 28, 37, 15, 29, 25, 17, 32, 13 ]) are connected end to end and slid toward each other. During the sliding process, a sliding overlapping window exists, so that a first [24, 32] segment in the waveform width feature sequence A is overlapped with a second [24, 32] segment in the waveform width feature copy sequence B, the waveform width feature difference sequence of the sliding overlapping window corresponds to the difference sequence segment of the [24, 32] segment being an all-zero segment [0, 0] meeting a preset difference value of 0, and therefore, a group of identical waveform width feature sequence segments [24, 32] can be determined to exist in the sliding overlapping window. During the sliding process, another sliding overlapping window is also existed to overlap the segment [28, 37, 24, 32, 25, 17, 32] of the waveform width characteristic sequence A and the segment [28, 37, 15, 29, 25, 17, 32] of the waveform width characteristic copy sequence B, the waveform width characteristic difference sequence of the sliding overlapping window corresponds to the difference sequence segment [0, 0, 9, 3, 0, 0] of the two waveform width characteristic sequence segments above, wherein the difference sequence segment includes two all-zero segments meeting the preset difference 0, one is the difference sequence segment [0, 0] corresponding to the waveform width characteristic sequence segment [28, 37], and the other is the difference sequence segment [0, 0, 0] corresponding to the waveform width characteristic sequence segment [25, 17, 32], so that it can be determined that two groups of identical waveform width characteristic sequence segments exist in the current sliding overlapping window, respectively [28, 37] and [25, 17, 32 ]. In summary, there are three sets of waveform width feature copy segments in the waveform width feature sequence of the audio data, which are [24, 32], [28, 37], [25, 17, 32], respectively.

In this embodiment, matching detection is performed according to the waveform width features of the sliding overlapping window, so that the feature sequence segments with the same waveform width can be quickly extracted, and all the feature sequence segments with the same waveform width can be effectively ensured to be extracted.

In one embodiment, the verifying the sets of audio data segments separately, and taking the audio data segment successfully verified as the voice copy segment in the audio data includes: in the same group of audio data segments, when the value of each sampling point to be matched of the current audio segment is correspondingly equal to the value of each sampling point to be matched of other audio segments, determining that the current audio segment and other audio segments are a group of voice copy segments, and taking each group of voice copy segments as voice copy segments in the audio data.

Specifically, a set of audio data segments includes two audio data segments, one of which serves as the current audio data segment and the other of which serves as the other audio data segment. In the same group of audio data segments, the terminal correspondingly matches each sampling point to be matched of the two audio data segments. For example, a set of audio data segments includes an audio data segment a and an audio data segment B. The sampling points to be matched of the audio data segment a comprise (1, 2, 3, 4, 5), the sampling points to be matched of the audio data segment B comprise (1, 2, 3, 4, 5), the values of the sampling points to be matched in the same position in the audio data segment a and the audio data segment B are equal, and then the terminal can take the audio data segment a and the audio data segment B as a group of voice copy segments. And respectively verifying each group of audio data fragments to obtain a plurality of groups of voice copy fragments, and taking each group of voice copy fragments as voice copy fragments in the audio data.

In the embodiment, the extracted audio segment which may be the voice duplication segment is further detected, so that the accuracy of detecting the voice duplication segment can be improved, and the detected audio segment is obtained by waveform width characteristic sequence pre-detection.

In one embodiment, the verifying the sets of audio data segments separately, and taking the audio data segment successfully verified as the voice copy segment in the audio data includes: in the same group of audio data segments, when the value of each sampling point to be matched of the current audio segment is in a proportional relation with the value of each sampling point to be matched of other audio segments, the current audio segment and other audio segments are determined to be a group of voice copy segments, and each group of voice copy segments are used as voice copy segments in the audio data.

Specifically, in the same group of audio data segments, the terminal correspondingly matches each sampling point to be matched of the two audio data segments. For example, a set of audio data segments includes an audio data segment a and an audio data segment B. The sampling points to be matched of the audio data segment A comprise (1, 2, 3, 4 and 5), the sampling points to be matched of the audio data segment B comprise (2, 4, 6,8 and 10), the value of each sampling point to be matched in the audio data segment B is twice of the value of the sampling point to be matched at the corresponding position in the audio data segment A, and then the terminal can take the audio data segment A and the audio data segment B as a group of voice copy segments. And respectively verifying each group of audio data fragments to obtain a plurality of groups of voice copy fragments, and taking each group of voice copy fragments as voice copy fragments in the audio data.

In one embodiment, the current audio data segment and the values of the sampling points to be matched in the other audio data segments are in a proportional relationship, and the current audio data segment and the values of the sampling points to be matched in the other audio data segments can also be in a segmented presentation. For example, the sample points to be matched of the audio data segment a include (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), and the sample points to be matched of the audio data segment B include (3, 6, 9, 12, 15, 12, 14, 16, 18, 20). For the first 5 sampling points, the value of the sampling point to be matched of the audio data segment B is 3 times that of the sampling point to be matched at the corresponding position in the audio data segment a. For the last 5 sampling points, the value of the sampling point to be matched of the audio data segment B is 2 times that of the sampling point to be matched at the corresponding position in the audio data segment a.

In one embodiment, as shown in fig. 7, after determining that the current audio segment and the other audio segments are a set of voice replication segments, the method further comprises:

step 702, in the audio data, an adjacent sampling point of the current audio segment and adjacent sampling points of other audio segments are obtained.

And step 704, correspondingly matching the adjacent sampling points of the current audio clip with the adjacent sampling points of other audio clips.

Step 706, when the matching is successful, combining the current audio clip and the adjacent sampling point of the current audio clip to obtain the expanded current audio clip, and combining the other audio clips and the adjacent sampling points of the other audio clips to obtain the expanded other audio clips.

Step 708, the expanded current audio segment and the expanded other audio segments are used as a group of voice copy segments.

The adjacent sampling points of the audio segment refer to the previous adjacent sampling points and the subsequent adjacent sampling points of the audio data segment in the audio data. The preceding adjacent sample points are sample points within a certain range before the first sample point of the audio data segment in the audio data. The subsequent adjacent sample points are sample points within a certain range after the last sample point of the audio data segment in the audio data.

Specifically, in the same group of audio data segments, if the current audio data segment is successfully matched with other audio data segments, the terminal obtains a first sampling point and a subsequent adjacent first sampling point of the current audio data segment, a first sampling point and a subsequent adjacent first sampling point of other audio data segments in the audio data, verifies the first sampling point and the subsequent adjacent first sampling point of other audio data segments, and verifies the first sampling point and the subsequent adjacent first sampling point of other audio data segments. For example, if the value of the first sampling point of the previous adjacent audio data segment is the same as the value of the first sampling point of the previous adjacent audio data segment, the verification is determined to be successful.

When the first sampling point adjacent to the front of the current audio data segment and the first sampling point adjacent to the front of other audio data segments are verified successfully, the terminal continuously acquires the last sampling point adjacent to the front of the current audio data segment and the last sampling point adjacent to the front of other audio data segments in the audio data, and verifies the last sampling point adjacent to the front of the current audio data segment and the last sampling point adjacent to the front of other audio data segments. And continuously acquiring the previous adjacent sampling point of the current audio data segment and the previous adjacent sampling points of other audio data segments from the audio data, and correspondingly checking the previous adjacent sampling point of the current audio data segment and the previous adjacent sampling points of other audio data segments until the checking fails.

Similarly, when the first sampling point next to and adjacent to the current audio data segment and the first sampling point next to and adjacent to the other audio data segment are verified successfully, the terminal continues to acquire the next sampling point next to and adjacent to the current audio data segment and the next sampling point next to and adjacent to the other audio data segment in the audio data, and verifies the next sampling point next to and adjacent to the current audio data segment and the next sampling point next to and adjacent to the other audio data segment. And continuously acquiring subsequent adjacent sampling points of the current audio data segment and subsequent adjacent sampling points of other audio data segments from the audio data, and correspondingly checking the subsequent adjacent sampling points of the current audio data segment and the subsequent adjacent sampling points of the other audio data segments until the checking fails.

Further, the terminal merges the previous adjacent sampling point and the subsequent adjacent sampling point of the current audio clip which is successfully verified with the current audio clip, merges the previous adjacent sampling point and the subsequent adjacent sampling point of the other audio clips which are successfully verified with the other audio clips, and takes the merged current audio clip and the merged other audio clips as a group of voice copy clips as the voice copy clips in the audio data.

For example, assume that the audio data is (1, 6,8,2,7,10, … …, -17, -8, -5, -1,2,8,2,7,10, … …, -17, -8, -6 … …), the current audio data segment is (10, … …, -17), the other audio data segments are (10, … …, -17), and the current audio data segment and the other audio data segments are determined to be successfully verified. Then, by the above method, it can be obtained that the expanded current audio data segment and other audio data segments are (8, 2,7,10, … …, -17, -8), and then the audio segment (8, 2,7,10, … …, -17, -8) is taken as a voice copy segment in the audio data.

In this embodiment, the adjacent sampling points of the current audio segment and the adjacent sampling points of other audio segments are obtained from the audio data, the adjacent sampling points of the current audio segment and the adjacent sampling points of other audio segments are correspondingly verified, the successfully verified adjacent sampling points, the current audio segment and other audio segments are correspondingly combined, the combined current audio segment and other combined audio segments are used as voice copy segments in the audio data, detection omission can be avoided, and the integrity of voice copy detection is improved.

In a specific embodiment, as shown in fig. 8, there is provided a voice detection method comprising the steps of:

1. feature extraction

And the terminal acquires the audio data and performs waveform feature extraction on the audio data to obtain a forward waveform width feature sequence of the audio data. Specifically, the audio data may be divided into sub-waveforms according to values of respective sampling points of the audio data and continuity thereof, each sub-waveform defines a waveform width according to the number of the sampling points therein, and defines a waveform direction according to the values of the sampling points therein, the waveform direction including a positive waveform and a negative waveform. And counting the sampling points of all forward waveforms in the audio data to obtain a forward waveform width characteristic sequence of the audio data. And similarly, a negative waveform width characteristic sequence or a bidirectional waveform width characteristic sequence can be obtained according to the requirement. The audio data is pre-detected through the waveform width characteristic sequence, so that the detected data volume can be greatly reduced.

2. Feature matching

And the terminal performs sliding window matching on the extracted forward wave characteristic sequence, and records the starting and ending positions of two wave characteristic sequence segments in each group of wave characteristic sequence segments when at least one group of same wave characteristic sequence segments are matched.

3. Replica fragment localization

And the terminal locates each group of candidate voice replication sections corresponding to each group of waveform characteristic sequence sections in the audio data according to the position information of two identical waveform characteristic sequence sections in each group of waveform characteristic sequence sections.

4. Data verification

And the terminal performs data verification on two candidate voice replication sections in each group of candidate voice replication sections and determines each group of target voice replication sections so as to ensure the accuracy of replication detection. The check is divided into two types: one is that the two candidate voice copied segments are identical segments, and one is that the two candidate voice copied segments are scaled equally.

5. Boundary extension

And the terminal performs edge extension on each group of target voice copy segments and performs data verification on the edge extension part so as to ensure the integrity of the voice copy segments.

In this embodiment, the waveform width feature sequence of the audio data is extracted, and audio segments with the same waveform width feature sequence in the audio data can be obtained by detecting the waveform width feature sequence. Since the data amount of the waveform width feature sequence is smaller than that of the audio data, the detection efficiency of the waveform width feature sequence is faster than that of the audio data. And the detection of the waveform width characteristic sequence is actually to screen the audio data, screen the audio data to obtain the audio segments which may be voice copy segments, then accurately detect the sampling points of the screened audio segments, determine the voice copy segments from the screened audio segments, reduce the calculation amount of accurate detection and improve the detection efficiency. And further, matching adjacent sampling points of the audio clips determined as the voice copied clips, and performing boundary expansion on the voice copied clips according to matching results, so that the integrity of the detection of the voice copied clips is further improved.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 9, there is provided a voice detection apparatus including: a feature extraction module 902, a feature matching module 904, an audio segment extraction module 906, and an audio segment matching module 908, wherein:

the feature extraction module 902 is configured to obtain audio data, and perform waveform feature extraction on the audio data to obtain a waveform width feature sequence of the audio data.

A feature matching module 904, configured to obtain a sliding overlap window corresponding to the waveform width feature sequence, and perform matching detection according to the waveform width feature of the sliding overlap window; when at least one group of characteristic sequence segments with the same waveform width exist in the characteristic sequence of the waveform width of the audio data obtained through the sliding overlapping window detection, the position information of each group of characteristic sequence segments with the same waveform width is determined.

The audio segment extracting module 906 is configured to determine, in the audio data, each corresponding group of audio data segments according to the position information of each group of feature sequence segments with the same waveform width.

The audio segment matching module 908 is configured to check each group of audio data segments, and use the successfully checked audio data segment as a voice copy segment in the audio data.

In one embodiment, the feature extraction module is further configured to divide the audio data into sub-waveforms according to values of sampling points of the audio data and continuity of the sampling points, where each sub-waveform defines a waveform width according to the number of the sampling points, and defines a waveform direction according to the values of the sampling points, and the waveform direction includes a positive waveform and a negative waveform; counting the number of sampling points corresponding to each sub-waveform to obtain the waveform width characteristic corresponding to each sub-waveform; and obtaining a waveform width characteristic sequence according to the waveform direction of each sub-waveform and the waveform width characteristic corresponding to each sub-waveform, wherein the waveform width characteristic sequence comprises a positive waveform width characteristic sequence, a negative waveform width characteristic sequence and a bidirectional waveform width characteristic sequence.

In one embodiment, the feature extraction module is further configured to extract a plurality of forward waveforms from the audio data according to each sampling point whose value is greater than a preset threshold of the forward waveform, and count the number of sampling points corresponding to each forward waveform to obtain a forward waveform width feature sequence; extracting a plurality of negative waveforms from the audio data according to each sampling point of which the value is smaller than a preset threshold of the negative waveforms, and counting the number of sampling points corresponding to each negative waveform to obtain a negative waveform width characteristic sequence; and counting the number of sampling points corresponding to each positive waveform and the number of sampling points corresponding to each negative waveform to obtain a bidirectional waveform width characteristic sequence.

In one embodiment, the feature matching module is further configured to obtain a waveform width feature copy sequence; the waveform width characteristic copy sequence and the waveform width characteristic sequence are the same sequence; connecting the waveform width characteristic sequence and the waveform width characteristic copy sequence end to end, starting to slide in opposite directions, and taking an overlapped area of the waveform width characteristic sequence and the waveform width characteristic copy sequence in the sliding process as a sliding overlapped window; in the current sliding overlapping window, calculating the difference value of a first sub-feature sequence corresponding to the waveform width feature sequence and a second sub-feature sequence corresponding to the waveform width feature copy sequence to obtain a waveform width feature difference value sequence corresponding to the current sliding overlapping window; and acquiring a first sub-characteristic sequence segment and a second sub-characteristic sequence segment corresponding to the segment position which accords with a preset difference value in the waveform width characteristic difference value sequence, and taking the first sub-characteristic sequence segment and the second sub-characteristic sequence segment as the same waveform width characteristic sequence segment.

In one embodiment, the audio segment matching module is further configured to determine, in the same group of audio data segments, that a current audio segment and other audio segments are a group of voice copy segments when a value of each to-be-matched sampling point of the current audio segment is equal to a value of each to-be-matched sampling point of the other audio segments, and use each group of voice copy segments as a voice copy segment in the audio data.

In one embodiment, the audio segment matching module is further configured to determine, in the same group of audio data segments, that a current audio segment and other audio segments are a group of voice copy segments when values of respective to-be-matched sampling points of the current audio segment are in a proportional relationship with values of respective to-be-matched sampling points of the other audio segments, and use each group of voice copy segments as a voice copy segment in the audio data.

In one embodiment, the audio segment matching module is further configured to obtain, in the audio data, adjacent sampling points of the current audio segment and adjacent sampling points of other audio segments; correspondingly matching adjacent sampling points of the current audio clip with adjacent sampling points of other audio clips; when the matching is successful, combining the current audio clip and the adjacent sampling point of the current audio clip to obtain an expanded current audio clip, and combining other audio clips and the adjacent sampling points of other audio clips to obtain other expanded audio clips; and taking the expanded current audio segment and the expanded other audio segments as a group of voice copy segments.

For the specific limitation of the voice detection device, reference may be made to the above limitation of the voice detection method, and details are not described herein. The modules in the voice detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal or a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program. In one embodiment, the processor, when executing the computer program, further performs the steps of:

in an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for speech detection, the method comprising:

2. The method of claim 1, wherein the performing waveform feature extraction on the audio data to obtain a waveform width feature sequence of the audio data comprises:

3. The method of claim 2, wherein the performing waveform feature extraction on the audio data to obtain a waveform width feature sequence of the audio data comprises:

extracting a plurality of negative waveforms from the audio data according to each sampling point of which the value is smaller than a preset threshold of the negative waveforms, and counting the number of sampling points corresponding to each negative waveform to obtain a negative waveform width characteristic sequence;

4. The method according to claim 1, wherein the obtaining of the sliding overlapping window corresponding to the waveform width feature sequence and the performing of the matching detection according to the waveform width feature of the sliding overlapping window comprises:

5. The method of claim 1, wherein the verifying the sets of audio data segments separately and using the successfully verified audio data segment as a voice copy segment in the audio data comprises:

6. The method of claim 1, wherein the verifying the sets of audio data segments separately and using the successfully verified audio data segment as a voice copy segment in the audio data comprises:

7. The method of any of claims 5 to 6, wherein after determining that the current audio segment and the other audio segments are a set of voice replication segments, the method further comprises:

8. A speech detection apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.