CN113516969B - Spliced voice identification method and device, electronic equipment and storage medium - Google Patents

Spliced voice identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113516969B
CN113516969B CN202111072051.1A CN202111072051A CN113516969B CN 113516969 B CN113516969 B CN 113516969B CN 202111072051 A CN202111072051 A CN 202111072051A CN 113516969 B CN113516969 B CN 113516969B
Authority
CN
China
Prior art keywords
voice
identified
spliced
speech
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111072051.1A
Other languages
Chinese (zh)
Other versions
CN113516969A (en
Inventor
孟凡芹
郑榕
邓菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuanjian Information Technology Co Ltd
Original Assignee
Beijing Yuanjian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuanjian Information Technology Co Ltd filed Critical Beijing Yuanjian Information Technology Co Ltd
Priority to CN202111072051.1A priority Critical patent/CN113516969B/en
Publication of CN113516969A publication Critical patent/CN113516969A/en
Application granted granted Critical
Publication of CN113516969B publication Critical patent/CN113516969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides a method and a device for identifying spliced voice, electronic equipment and a storage medium, wherein the obtained voice to be identified is cut into a plurality of voice sections to be identified; determining the voice segment type of each voice segment to be identified through the voice segment to be identified fusion voice characteristics and the spliced voice identification model; and smoothing the voice to be identified, determining whether the voice is spliced voice, and if the voice is spliced voice, determining the number of voice splicing points and the voice splicing position of the spliced voice based on the number of target combined spliced voice segments and the relative position of each target combined spliced voice segment in the voice to be identified. Like this, this application is based on treating the integration pronunciation characteristic of distinguishing the pronunciation section and discerning to and confirm whether to wait to distinguish the pronunciation for the concatenation pronunciation through smooth processing, and merge the concatenation pronunciation section through the target that determines behind the smooth processing and confirm the concatenation point number and the concatenation position of concatenation pronunciation, thereby improve the degree of accuracy and the fineness of pronunciation differentiation.

Description

Spliced voice identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for identifying a spliced speech, an electronic device, and a storage medium.
Background
With the continuous progress of society and the continuous development of science and technology, people can conveniently acquire voice data by using equipment such as a mobile phone, a recording pen, a camera and the like, and meanwhile, splicing and counterfeiting operations such as cutting, copying, pasting and the like can be conveniently carried out on voice by a plurality of software aiming at voice editing. In some civil litigation cases, recorded evidence becomes an important ring in the chain of evidence. However, after many speeches are spliced, the authenticity and integrity of the speeches are not easy to judge. In voice forensics, identifying whether a section of voice material is spliced or not has become a hot spot problem in voice forensics.
The existing voice identification method aiming at the falsification and generation of the splicing mode can only identify whether the voice to be identified is spliced voice generally, but can not identify the splicing quantity, the splicing position and the splicing type accurately.
Disclosure of Invention
In view of this, an object of the present application is to provide a method, an electronic device, and a storage medium for identifying a spliced voice, which determine whether a voice segment to be identified is a spliced voice segment according to a fusion voice feature of the voice segment to be identified, perform a smoothing process on the voice to be identified having the spliced voice segment, determine whether the voice segment to be identified includes a target merged spliced voice segment, thereby determining whether the voice segment is the spliced voice, and determine the number of voice splicing points and the splicing position of the spliced voice based on the target merged spliced voice segment, thereby improving the accuracy and fineness of the spliced voice identification.
The embodiment of the application provides an identification method of spliced voice, which comprises the following steps:
cutting the acquired voice to be identified into a plurality of voice sections to be identified;
aiming at each voice section to be identified, extracting fused voice features for expressing the characteristics of the voice section to be identified from the voice section to be identified;
inputting the fusion voice features into a pre-trained spliced voice identification model, and determining the voice section type of the voice section to be identified;
when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, smoothing all the voice sections to be identified included in the voice to be identified, and determining whether the voice to be identified after smoothing includes a target merged spliced voice section or not;
when the voice to be identified comprises a target merged spliced voice section, determining the voice type of the voice to be identified as spliced voice, and acquiring at least one target merged spliced voice section generated after smoothing;
and determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
Optionally, the cutting the acquired speech to be identified into a plurality of speech segments to be identified includes:
and according to the preset window length and window movement of the cutting window, moving the cutting window on the voice to be identified according to the window movement according to the time sequence, cutting the voice positioned in the cutting window in each moving process, and cutting a plurality of voice sections to be identified.
Optionally, the speech segment types include: a natural speech section and a spliced speech section;
the spliced speech segments include homologous spliced speech segments and heterologous spliced speech segments.
Optionally, the voice type includes natural voice and spliced voice;
the spliced voices comprise homologous spliced voices, heterologous spliced voices and mixed spliced voices.
Optionally, the inputting the fusion voice feature into a pre-trained spliced voice identification model to determine the voice segment type of the voice segment to be identified includes:
for each voice section to be identified in the voice to be identified, inputting the fusion voice characteristics of the voice section to be identified into a pre-trained voice identification model, and determining the probability that the voice section to be identified belongs to each voice section type;
and determining the voice section type corresponding to the maximum value of the probability that the voice section to be identified belongs to each voice section type as the voice section type to which the voice section to be identified belongs.
Optionally, when the speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, performing smoothing on the speech to be identified to determine whether the speech to be identified includes a target merged spliced speech segment, including:
dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified;
aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified;
when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment;
and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
Optionally, the determining, for each to-be-identified speech segment group, a splicing type of the to-be-identified speech segment group according to a speech segment type of each to-be-identified speech segment in the to-be-identified speech segment group includes:
and when the number of the continuous spliced voice sections in the voice group to be identified exceeds a preset second number, determining the splicing type of the voice group to be identified as continuous splicing.
Optionally, the determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice segments in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice segments in the spliced voice includes:
determining the total number of target combined spliced voice segments included in the spliced voice as the number of voice splicing points of the spliced voice;
determining the mapping position of the middle position of each target merged spliced voice segment in the voice to be identified according to the mapping relation between the spliced voice and each target merged spliced voice segment;
and determining the mapping position of each target merging and splicing voice segment as the voice splicing position of the voice to be identified.
Optionally, the spliced speech discrimination model is constructed by the following method:
acquiring a voice training sample set formed by natural voice, homologous splicing voice and heterologous splicing voice; the frame number of each voice training sample in the voice training sample set is the same, namely the sample length is the same;
aiming at each voice training sample in the voice training sample set, performing voice feature extraction on the voice training sample by adopting a plurality of voice feature extraction methods to obtain a plurality of voice features of the voice training sample;
for each voice training sample in the voice training sample set, determining a fused voice feature of the voice training sample based on a Fisher criterion and a plurality of voice features of the voice training sample;
and performing iterative training on a preset neural network by using the fusion voice characteristics of each voice training sample in the voice training sample set to generate a spliced voice identification model.
Optionally, the preset neural network includes an LCNN sub-network and a GRU sub-network, and the activation function of the LCNN sub-network is a CELU function.
The embodiment of the present application further provides an authentication apparatus for spliced speech, where the authentication apparatus includes:
the cutting module is used for cutting the acquired voice to be identified into a plurality of voice sections to be identified;
the extraction module is used for extracting fused voice features used for expressing the characteristics of the voice sections to be identified from the voice sections to be identified aiming at each voice section to be identified;
the voice section identification module is used for inputting the fusion voice characteristics into a pre-trained spliced voice identification model and determining the voice section type of the voice section to be identified;
the smoothing processing module is used for smoothing all the voice sections to be identified included in the voice to be identified when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, and determining whether the voice to be identified after smoothing processing includes a target merged spliced voice section or not;
the acquiring module is used for determining the voice type of the voice to be identified as spliced voice when the voice to be identified comprises a target merged spliced voice section, and acquiring at least one target merged spliced voice section generated after smoothing processing;
and the splicing point identification module is used for determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
Optionally, when the clipping module is configured to clip the acquired speech to be identified into a plurality of speech segments to be identified, the clipping module is configured to:
and according to the preset window length and window movement of the cutting window, moving the cutting window on the voice to be identified according to the window movement according to the time sequence, cutting the voice positioned in the cutting window in each moving process, and cutting a plurality of voice sections to be identified.
Optionally, the speech segment types include: a natural speech section and a spliced speech section;
the spliced speech segments include homologous spliced speech segments and heterologous spliced speech segments.
Optionally, the voice type includes natural voice and spliced voice;
the spliced voices comprise homologous spliced voices, heterologous spliced voices and mixed spliced voices.
Optionally, when the speech segment identification module is configured to input the fused speech feature into a pre-trained spliced speech identification model and determine a speech segment type of the speech segment to be identified, the speech segment identification module is configured to:
for each voice section to be identified in the voice to be identified, inputting the fusion voice characteristics of the voice section to be identified into a pre-trained voice identification model, and determining the probability that the voice section to be identified belongs to each voice section type;
and determining the voice section type corresponding to the maximum value of the probability that the voice section to be identified belongs to each voice section type as the voice section type to which the voice section to be identified belongs.
Optionally, when the speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, the smoothing module is configured to smooth the speech to be identified, and determine whether the speech to be identified includes a target merged spliced speech segment, and the smoothing module is configured to:
dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified;
aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified;
when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment;
and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
Optionally, when the smoothing processing module is configured to determine, for each to-be-identified speech segment group, a splicing type of the to-be-identified speech segment group according to a speech segment type of each to-be-identified speech segment in the to-be-identified speech segment group, the smoothing processing module is configured to:
and when the number of the continuous spliced voice sections in the voice group to be identified exceeds a preset second number, determining the splicing type of the voice group to be identified as continuous splicing.
Optionally, the splice point identification module is configured to determine the number of voice splice points of the spliced voice based on the number of target merged spliced voice segments in the spliced voice, and when determining the voice splice position of the spliced voice based on the relative position of the target merged spliced voice segments in the spliced voice, the splice point identification module is configured to:
determining the total number of target combined spliced voice segments included in the spliced voice as the number of voice splicing points of the spliced voice;
determining the mapping position of the middle position of each target merged spliced voice segment in the voice to be identified according to the mapping relation between the spliced voice and each target merged spliced voice segment;
and determining the mapping position of each target merging and splicing voice segment as the voice splicing position of the voice to be identified.
Optionally, the identification apparatus further includes a model building module, and the model building module is configured to:
acquiring a voice training sample set formed by natural voice, homologous splicing voice and heterologous splicing voice; the frame number of each voice training sample in the voice training sample set is the same;
aiming at each voice training sample in the voice training sample set, performing voice feature extraction on the voice training sample by adopting a plurality of voice feature extraction methods to obtain a plurality of voice features of the voice training sample;
for each voice training sample in the voice training sample set, determining a fused voice feature of the voice training sample based on a Fisher criterion and a plurality of voice features of the voice training sample;
and performing iterative training on a preset neural network by using the fusion voice characteristics of each voice training sample in the voice training sample set to generate a spliced voice identification model.
Optionally, the preset neural network includes an LCNN sub-network and a GRU sub-network, and the activation function of the LCNN sub-network is a CELU function.
An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of authenticating a spliced voice as described above.
The embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program is executed by a processor to perform the steps of the method for authenticating a spliced voice as described above.
The application provides a spliced voice identification method, a spliced voice identification device, electronic equipment and a storage medium, wherein the identification method comprises the following steps: cutting the acquired voice to be identified into a plurality of voice sections to be identified; aiming at each voice section to be identified, extracting fused voice features for expressing the characteristics of the voice section to be identified from the voice section to be identified; inputting the fusion voice features into a pre-trained spliced voice identification model, and determining the voice section type of the voice section to be identified; when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, smoothing all the voice sections to be identified included in the voice to be identified, and determining whether the voice to be identified after smoothing includes a target merged spliced voice section or not; when the voice to be identified comprises a target merged spliced voice section, determining the voice type of the voice to be identified as spliced voice, and acquiring at least one target merged spliced voice section generated after smoothing; and determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
Therefore, the coverage range of the splicing point for splicing the voice is improved by fixing the voice frame number of the voice training sample; the voice training samples are subdivided into natural voice, homologous splicing and heterologous splicing, various voice characteristics are extracted from all the voice training samples and are fused, model training is carried out by adopting an LCNN and GRU combined neural network, and a voice identification model capable of identifying various voice types is obtained; the number of the voice splicing points and the splicing position of the spliced voice can be determined by adopting a sliding window identification mode to carry out voice identification; by smoothing the speech segment, the accuracy of identifying the number of speech splicing points and splicing positions of the spliced speech can be effectively improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a flowchart of an authentication method for spliced speech according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for constructing a spliced speech discrimination model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an apparatus for discriminating a spliced speech according to an embodiment of the present application;
fig. 4 is a second schematic structural diagram of a spliced speech discrimination apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.
With the continuous progress of society and the continuous development of science and technology, people can conveniently acquire voice data by using equipment such as a mobile phone, a recording pen, a camera and the like, and meanwhile, splicing and counterfeiting operations such as cutting, copying, pasting and the like can be conveniently carried out on voice by a plurality of software aiming at voice editing. In some civil litigation cases, recorded evidence becomes an important ring in the chain of evidence. However, after many speeches are spliced, the authenticity and integrity of the speeches are not easy to judge. In voice forensics, identifying whether a section of voice material is spliced or not has become a hot spot problem in voice forensics.
The existing voice identification method aiming at the falsification and generation of the splicing mode can only identify whether the voice to be identified is spliced voice generally, but can not identify the splicing quantity, the splicing position and the splicing type accurately.
Based on the extracted fusion voice features, each voice section in the voice to be identified is subjected to fine identification, so that whether the voice to be identified is the spliced voice or not can be achieved, accurate identification of the number of spliced points and the splicing position in the process of splicing the voice can be achieved, and the identification accuracy is improved.
Referring to fig. 1, fig. 1 is a flowchart illustrating an authentication method for spliced speech according to an embodiment of the present disclosure. As shown in fig. 1, the method for identifying a spliced speech provided in the embodiment of the present application includes:
s101, cutting the acquired voice to be identified into a plurality of voice sections to be identified.
In the step, the voice which needs to be identified as spliced voice is determined as the voice to be identified, and the voice to be identified can be obtained from various voice environments; and for the acquired voice to be identified, using a preset voice cutting window to cut the voice to be identified to obtain a plurality of voice sections to be identified of the voice to be identified.
Optionally, the cutting the acquired speech to be identified into a plurality of speech segments to be identified includes: and according to the preset window length and window movement of the cutting window, moving the cutting window on the voice to be identified according to the window movement according to the time sequence, cutting the voice positioned in the cutting window in each moving process, and cutting a plurality of voice sections to be identified.
In the step, a voice clipping window is set in a preset voice clipping rule, and the window length and the window shift of the clipping window are set at the same time; and according to the time sequence of the playing of the voice to be identified, moving the cutting window on the voice to be identified according to the set window movement, cutting the voice positioned in the cutting window each time of movement, and cutting a plurality of voice sections to be identified of the voice to be identified.
Here, the window length setting of the clipping window is determined by the number of speech frames of the speech training samples for generating the concatenated speech discrimination model, and both have the same length. Wherein the length of the window shift is less than the length of the window length; overlapping speech exists in two continuous speech segments to be identified.
For example, when the window length of the cropping window is set to be N frames, the corresponding window shift length may be selected to be a length between N/2 and N/4.
S102, aiming at each voice section to be identified, extracting fused voice features used for expressing the characteristics of the voice section to be identified from the voice section to be identified.
In the step, after a plurality of voice sections to be identified of the voice to be identified are obtained, for each voice section to be identified, the voice features of the voice section to be identified are extracted by using a plurality of voice feature modes, a plurality of voice features of the voice to be identified are extracted, and then the plurality of voice features of the voice section to be identified are used for fusion to obtain the fusion voice features of the voice section to be identified.
When the multiple voice features of the voice segment to be identified are used for fusion to obtain the fusion voice feature of the voice segment to be identified, determining the Fisher ratio of each dimension component in the voice feature based on the Fisher criterion aiming at each voice feature of the voice segment to be identified, reserving the component with the larger Fisher ratio of the voice feature and with the preset dimension, and fusing the features of the components with the preset dimension of each voice feature of the voice segment to be identified to obtain the fusion voice feature of the voice segment to be identified.
The plurality of speech features may include, but are not limited to, MFCC features, LPCC features, STFT features, and the like, the pre-set dimension is selected to be the same as the speech feature dimension of the speech training sample, and the dimension of each speech feature is the same.
S103, inputting the fusion voice features into a pre-trained spliced voice identification model, and determining the voice section type of the voice section to be identified.
In the step, after the fusion voice feature of any voice segment to be identified is extracted, the fusion voice feature vector of the voice to be identified is input into a pre-trained spliced voice identification model, and the spliced voice identification model identifies whether the voice segment to be identified is a spliced voice segment based on the input fusion voice feature, so as to determine the voice segment type of the voice segment to be identified.
Optionally, the speech segment types include: a natural speech section and a spliced speech section; the spliced speech segments include homologous spliced speech segments and heterologous spliced speech segments.
Here, the natural speech segment is a speech which is not artificially changed and is directly uttered by a real person; the spliced voice segment is a spliced and forged voice segment and is also a synthesized and forged voice segment; the homonymous splicing voice section is formed by splicing voices acquired at different times by using the same audio acquisition equipment; the heterogeneously spliced voice segment is formed by splicing voices acquired at different times by using different audio acquisition equipment. The spliced voice segment may also include other types of spliced voices, which is not limited herein.
Optionally, the inputting the fusion voice feature into a pre-trained spliced voice identification model to determine the voice segment type of the voice segment to be identified includes: for each voice section to be identified in the voice to be identified, inputting the fusion voice characteristics of the voice section to be identified into a pre-trained voice identification model, and determining the probability that the voice section to be identified belongs to each voice section type; and determining the voice section type corresponding to the maximum value of the probability that the voice section to be identified belongs to each voice section type as the voice section type to which the voice section to be identified belongs.
In the step, the specific steps of identifying the type of the speech section to be identified by using the spliced speech identification model are as follows: the method comprises the steps of cutting out each voice section to be identified according to voice to be identified, inputting the fusion voice feature of the voice section to be identified into a pre-trained voice identification model, judging the probability that the voice section to be identified belongs to each voice section type such as a natural voice section, a homologous splicing voice section and a heterologous splicing voice section by the voice identification model based on the fusion voice feature of the voice section to be identified, determining a plurality of probability values, and finally determining the voice section type with the maximum probability value as the voice section type to which the voice section to be identified belongs.
For example, if the probability of identifying a speech segment to be identified as belonging to a natural speech segment through the speech identification model is 0.1, the probability of belonging to a homologous spliced speech segment is 0.6, and the probability of belonging to a heterologous spliced speech segment is 0.8, the speech segment to be identified belongs to a heterologous spliced speech segment.
Optionally, please refer to fig. 2, and fig. 2 is a flowchart of a method for constructing a spliced speech discrimination model according to an embodiment of the present application. As shown in fig. 2, the method for constructing a spliced speech discrimination model provided in the embodiment of the present application includes:
s201, acquiring a voice training sample set formed by natural voice, homologous splicing voice and heterologous splicing voice; the number of frames of each voice training sample in the voice training sample set is the same.
In the step, a speech training sample set is obtained through the following steps: acquiring a plurality of original voices of a plurality of users, and cutting out a plurality of voices with the same length from the plurality of original voices according to a preset requirement; selecting partial voice from the plurality of cut voices as natural voice of a voice training sample set; and splicing the voices of the same person according to the source of the equipment in the plurality of pieces of cut voices, reserving the spliced voice containing one spliced point, cutting the spliced voice into spliced voices with the same number as the natural voice frames, and obtaining the homologous spliced voice and the heterologous spliced voice. Wherein the position of the splicing point in the homologously spliced voice and the heterologously spliced voice is random at the position of the spliced voice segment.
Here, the speech training sample refers to any one of natural speech, homologous splicing speech, and heterologous splicing speech, and the frame number of the speech training sample may be selected by a researcher according to research requirements, and the frame number of the speech training sample is the same as the frame number of the speech segment to be identified in S101.
S202, aiming at each voice training sample in the voice training sample set, performing voice feature extraction on the voice training sample by adopting a plurality of voice feature extraction methods to obtain a plurality of voice features of the voice training sample.
In the step, after a voice training sample set for constructing the spliced voice identification model is determined, for each voice training sample in the training set, a plurality of voice feature extraction methods are adopted to extract a plurality of voice features of the voice training sample.
Here, the plurality of speech features includes, but is not limited to, MFCC, LPCC, STFT, and the like.
S203, aiming at each voice training sample in the voice training sample set, determining the fusion voice feature of the voice training sample based on the Fisher criterion and the multiple voice features of the voice training sample.
In this step, after obtaining multiple speech features of each speech training sample, for each speech training sample, based on a Fisher criterion calculation formula, determining a Fisher ratio of each dimension feature component in each speech feature, selecting an optimal feature component of a preset number of dimensions for any one speech feature in the speech training sample, then fusing the feature components of each speech feature, and determining a fused speech feature of the speech training sample.
Here, the calculation formula of the Fisher ratio is as follows:
Figure M_210914102538503_503730001
wherein,
Figure M_210914102538644_644836001
is the Fisher ratio of the feature components,
Figure M_210914102538676_676517002
the inter-class dispersion of the feature components is represented,
Figure M_210914102538723_723925003
representing the intra-class dispersion of the feature components. The category here refers to natural speech and multiple types of speech formed by multiple concatenative approaches (homologous and heterologous).
Wherein,
Figure M_210914102538755_755149001
inter-class dispersion sum of feature components
Figure M_210914102538786_786437002
The calculation formulas of the intra-class dispersion of the characteristic components are respectively as follows:
Figure M_210914102538817_817699001
Figure M_210914102538897_897773001
in the first formula:
Figure M_210914102539007_007107001
inter-class dispersion representing feature componentsThe dispersion of the mean values of different voice characteristic components reflects the difference degree between different voice training samples; m represents the total number of speech training samples of all classes, each class of speech containing multiple samples,
Figure M_210914102539038_038393002
the mean of the k-th dimension feature components of all training sample features representing a certain class of speech i,
Figure M_210914102539086_086770003
represents the mean of the K-th dimensional components of all speech sample features.
In the second formula, the first formula is,
Figure M_210914102539118_118423001
the intra-class dispersion of the feature components, i.e. the mean value of the dispersion of the voice feature components of the same class, M represents the total number of voice training samples of all classes,
Figure M_210914102539165_165346002
the number of samples representing a certain class of speech i,
Figure M_210914102539196_196324003
representing the mean value of the k-dimension characteristic components of all the training sample characteristics of the speech i;
Figure M_210914102539243_243450004
and representing the k-dimension characteristic component parameter of j-th segment of speech of the i training sample of the speech.
For each speech training sample in the speech training sample set, assuming that Q features of the speech training sample, such as MFCC, LPCC, STFT, etc., are extracted, each feature has an L-dimensional feature component, and each of the L-dimensional feature components of any feature of the speech training sample is calculated based on the Fisher ratio calculation formula
Figure M_210914102539291_291935001
Value, get
Figure M_210914102539323_323060002
And the front K dimensional feature component with the maximum value is fused to obtain the Q x K dimensional fused voice feature of the voice training sample.
And S204, performing iterative training on a preset neural network by using the fusion voice characteristics of each voice training sample in the voice training sample set to generate a spliced voice identification model.
In the step, after the fusion feature of each voice training sample is extracted, the fusion voice feature of each voice training sample is used as input, a natural voice tag or a homologism splicing voice tag or a heterology splicing voice tag corresponding to each voice training sample is used as output, iterative training is carried out on a preset neural network, and a splicing voice identification model is generated.
Optionally, the preset neural network includes an LCNN and a GRU, and an activation function of the LCNN sub-network is a CELU function. Here, the predetermined neural network includes 5 convolutional layers, 5 MaxPooling layers, 3 BatchNorm layers, 1 AdaptiveAvgPool layer, activation functions CELU, 1 GRU layer, 1 DropOut layer, and finally 1 layer all-connected for classification.
When iterative training is carried out on the preset neural network, the fused speech features of the speech training samples are firstly input into the LCNN network, the high-level features output by the LCNN are used as the input of the GRU, the GRU carries out feature selection and then outputs, and the GRU is connected with the whole connection for classification. The activation function used by the current general LCNN network is MFM (maximum feature map), and the activation function is replaced by CELU (matrix decomposition unit) because the activation function selects and reduces latitude of the input features, which causes the network to be over-fitted. The CELU can make the average output of the system zero and no feature selection, so that overfitting can be avoided while the model convergence speed is improved. At present, the general model only uses one network of LCNN or GRU, but the invention uses the two networks in combination, the LCNN effectively captures the characteristic information of frame-level data by using the translation invariance of the LCNN in time and space, and the GRU is used for learning the long-term dependence of subsequent high-level characteristics and capturing the high-order correlation of spliced voice in a frequency domain. Therefore, the combination of LCNN and GRU can fully capture the difference of natural voice and spliced voice in space and time, and carry out more accurate classification. The CELU function formula is as follows, where α is a learnable parameter.
Figure M_210914102539354_354328001
S104, when the voice section type of any voice section to be identified indicates that the voice section to be identified is the spliced voice section, smoothing all the voice sections to be identified included in the voice to be identified, and determining whether the voice to be identified after smoothing includes the target merged spliced voice section.
In the step, after the identification of the to-be-identified voice segments of the to-be-identified voice is identified through the spliced voice identification model, when any spliced voice segment of the to-be-identified voice is identified, all the to-be-identified voice segments included in the to-be-identified voice are subjected to smoothing processing, and whether the to-be-identified voice segments include the target merged spliced voice segment or not is determined after the smoothing processing. Here, the target merged spliced speech segment is a deciding factor for determining whether the speech to be discriminated is a spliced speech.
The purpose of smoothing the speech to be identified is to reduce the occurrence of misrecognition that recognizes natural speech as spliced speech as much as possible.
Optionally, when the speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, performing smoothing on the speech to be identified to determine whether the speech to be identified includes a target merged spliced speech segment, including: dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified; aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified; when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment; and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
In this step, after identifying the speech segment type of the speech segment to be identified included in the speech to be identified, when any speech segment to be identified is the spliced speech segment, the speech to be identified needs to be smoothed to determine whether the speech to be identified includes the target merged spliced speech segment, and the specific steps of smoothing the speech to be identified are as follows:
firstly, according to the time sequence corresponding to the voice to be identified, every N continuous voice sections to be identified are grouped into one group, and a voice section group to be identified is obtained. Here, N is the first number, and N may be selected for applicability as desired.
The number of the speech segments to be identified contained in the last speech segment group to be identified may be less than N.
In an example, assuming that there are 9 speech segments to be identified in the speech to be identified, the first number is 3, the 1 st to 3 th speech segments to be identified form a speech segment group to be identified, the 4 th to 6 th speech segments to be identified form a speech segment group to be identified, and the 7 th to 9 th speech segments to be identified form a speech segment group to be identified. It should be noted that the number of speech segments to be identified in the speech to be identified is often much larger than 9.
Then, after a plurality of speech segment groups to be identified are determined, for each speech segment group to be identified, the splicing type of the speech segment group to be identified can be determined according to the speech segment type and the position of each speech segment to be identified included in the speech segment group to be identified.
Optionally, the determining, for each to-be-identified speech segment group, a splicing type of the to-be-identified speech segment group according to a speech segment type of each to-be-identified speech segment in the to-be-identified speech segment group includes: and when the number of the continuous spliced voice sections in the voice group to be identified exceeds a preset second number, determining the splicing type of the voice group to be identified as continuous splicing. And when the number of the continuous spliced voice sections in the voice group to be identified does not exceed the preset second number, determining the splicing type of the voice group to be identified as discontinuous splicing.
This step is exemplified by the following example. Assuming that the first number is 3 and the second number is 2, the speech group to be identified has a first speech segment to be identified, a second speech segment to be identified, and a third speech segment to be identified. When the first voice section to be identified and the second voice section to be identified are spliced voice sections and the third voice section to be identified is a natural voice section, the voice section group to be identified is continuously spliced; when the second voice segment to be identified and the third voice segment to be identified are spliced voice segments and the first voice segment to be identified is a natural voice segment, the voice segment group to be identified is continuously spliced; when the first voice section to be identified, the second voice section to be identified and the third voice section to be identified are spliced voice sections, the voice section group to be identified is continuously spliced; in other cases, the speech segment group to be identified is non-continuously spliced, and all the speech segments to be identified in the speech segment group to be identified are regarded as natural speech segments.
Then, if the splicing type of the voice segment group to be identified is determined to be continuous splicing, synthesizing all the voice segments to be identified in the voice segment group to be identified into a voice segment according to the time sequence to generate a synthesized voice segment, wherein the synthesized voice segment is a target combined spliced voice segment of the spliced voice; and if the splicing type of the voice section group to be identified is determined to be discontinuous splicing, combining the voice section group to be identified into a natural voice section.
For example, it is assumed that the speech group to be identified has a first speech segment to be identified, a second speech segment to be identified, and a third speech segment to be identified. When the first voice section to be identified and the second voice section to be identified are spliced voice sections and the third voice section to be identified is a natural voice section, the voice section group to be identified is continuously spliced, the first voice section to be identified, the second voice section to be identified and the third voice section to be identified are synthesized into one voice section according to the time sequence, and the voice section is a target combined spliced voice section; when the first speech section to be identified and the third speech section to be identified are spliced speech sections and the second speech section to be identified is a natural speech section, the speech section group to be identified is discontinuously spliced, the first speech section to be identified to the third speech section to be identified are combined into one speech section according to the time sequence, and the combined speech section is determined to be the natural speech section.
It should be noted that the second number can be selected according to the requirement, and the second number should not be greater than the first number.
And finally, if the splicing type of any one voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes the target merged spliced voice section.
S105, when the voice to be identified comprises the target merged spliced voice section, determining the voice type of the voice to be identified as spliced voice, and acquiring at least one target merged spliced voice section generated after smoothing processing.
In this step, after the smoothing process, when it is determined that the speech to be identified includes the target merged spliced speech segment, the speech type of the speech to be identified is determined as spliced speech, that is, the speech to be identified is spliced speech. In order to accurately determine the spliced voice splicing point and the splicing position of the spliced voice, all target merged spliced voice segments included in the voice to be identified after smoothing processing are obtained at the same time. Here, the speech to be discriminated includes a number of target merged spliced speech segments which is necessarily larger than 1, and the target merged spliced speech segments are finally used for determining the spliced speech splicing point and the speech segments of the splicing position.
Optionally, the voice type includes natural voice and spliced voice; the spliced voices comprise homologous spliced voices, heterologous spliced voices and mixed spliced voices.
Here, the homologous splicing means that voices of the same person are collected by the same audio collection device at different times and spliced together by manual means, and the spliced voice segments contained in the voices are only homologous spliced voice segments; the heterogeneous splicing is the voice which is obtained by collecting the voice of the same person by different audio collecting equipment at different/same time and splicing the voice together by adopting a manual method, wherein the contained spliced voice section only comprises a heterogeneous spliced voice section; the mixed spliced voice refers to the fact that the spliced voice segments contained in the mixed spliced voice have both different source spliced voice segments and same source spliced voice segments.
S106, determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
In this step, when it is determined that the speech to be identified is the spliced speech, the total number of all target combined spliced speech segments included in the speech to be identified is determined, the number of speech splicing points included in the spliced speech is determined, and the speech splicing position of the spliced speech is determined based on the position of the middle position of each target combined spliced speech segment in the spliced speech.
Optionally, the determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice segments in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice segments in the spliced voice includes: determining the total number of target combined spliced voice segments included in the spliced voice as the number of voice splicing points of the spliced voice; determining the mapping position of the middle position of each target merged spliced voice segment in the voice to be identified according to the mapping relation between the spliced voice and each target merged spliced voice segment; and determining the mapping position of each target merging and splicing voice segment as the voice splicing position of the voice to be identified.
Here, each target merged spliced speech segment is clipped from the spliced speech, and the middle position of each target merged spliced speech segment has a corresponding mapping position in the spliced speech.
In an example, after determining that the speech to be identified is the spliced speech, the spliced speech includes 3 target merged spliced speech segments, and then the spliced speech includes 3 speech splicing points and 3 speech splicing positions.
The application provides an identification method of spliced voice, which comprises the following steps: cutting the acquired voice to be identified into a plurality of voice sections to be identified; aiming at each voice section to be identified, extracting fused voice features for expressing the characteristics of the voice section to be identified from the voice section to be identified; inputting the fusion voice features into a pre-trained spliced voice identification model, and determining the voice section type of the voice section to be identified; when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, determining that the voice type of the voice to be identified is spliced voice; when the voice to be identified is spliced voice, performing smoothing processing on the spliced voice, and determining a target merged spliced voice section included by the spliced voice; and determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
Therefore, the coverage range of the splicing point for splicing the voice is improved by fixing the voice frame number of the voice training sample; the voice training samples are subdivided into natural voice, homologous splicing and heterologous splicing, various voice characteristics are extracted from all the voice training samples and are fused, model training is carried out by adopting an LCNN and GRU combined neural network, and a voice identification model capable of identifying various voice types is obtained; the number of the voice splicing points and the splicing position of the spliced voice can be determined by adopting a sliding window identification mode to carry out voice identification; by smoothing the speech segment, the accuracy of identifying the number of speech splicing points and splicing positions of the spliced speech can be effectively improved.
Referring to fig. 3 and 4, fig. 3 is a schematic structural diagram of a spliced speech discriminating apparatus according to an embodiment of the present application, and fig. 4 is a second schematic structural diagram of a spliced speech discriminating apparatus according to an embodiment of the present application. As shown in fig. 3, the authentication apparatus 300 includes:
the cutting module 310 is configured to cut the acquired voice to be identified into a plurality of voice segments to be identified;
an extracting module 320, configured to, for each to-be-identified speech segment, extract a fused speech feature representing characteristics of the to-be-identified speech segment from the to-be-identified speech segment;
a speech segment identification module 330, configured to input the fusion speech features into a pre-trained spliced speech identification model, and determine a speech segment type of the speech segment to be identified;
a smoothing module 340, configured to, when a speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, smooth all speech segments to be identified included in the speech to be identified, and determine whether the speech to be identified after being smoothed includes a target merged spliced speech segment;
an obtaining module 350, configured to determine that the voice type of the voice to be identified is a spliced voice when the voice to be identified includes a target merged spliced voice segment, and obtain at least one target merged spliced voice segment generated after smoothing processing;
and the splicing point identification module 360 is used for determining the number of the voice splicing points of the spliced voice based on the number of the target merged spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target merged spliced voice sections in the spliced voice.
Optionally, when the clipping module 310 is configured to clip the acquired speech to be identified into a plurality of speech segments to be identified, the clipping module 310 is configured to:
and according to the preset window length and window movement of the cutting window, moving the cutting window on the voice to be identified according to the window movement according to the time sequence, cutting the voice positioned in the cutting window in each moving process, and cutting a plurality of voice sections to be identified.
Optionally, the speech segment types include: a natural speech section and a spliced speech section;
the spliced speech segments include homologous spliced speech segments and heterologous spliced speech segments.
Optionally, the voice type includes natural voice and spliced voice;
the spliced voices comprise homologous spliced voices, heterologous spliced voices and mixed spliced voices.
Optionally, when the speech segment identification module 330 is configured to input the fused speech feature into a pre-trained spliced speech identification model, and determine a speech segment type of the speech segment to be identified, the speech segment identification module 330 is configured to:
for each voice section to be identified in the voice to be identified, inputting the fusion voice characteristics of the voice section to be identified into a pre-trained voice identification model, and determining the probability that the voice section to be identified belongs to each voice section type;
and determining the voice section type corresponding to the maximum value of the probability that the voice section to be identified belongs to each voice section type as the voice section type to which the voice section to be identified belongs.
Optionally, when the speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, the smoothing processing module 340 is configured to smooth the speech to be identified, and determine whether the speech to be identified includes a target merged spliced speech segment, where the smoothing processing module 340 is configured to:
dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified;
aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified;
when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment;
and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
Optionally, when the smoothing processing module 340 is configured to determine, for each to-be-identified speech segment group, a splicing type of the to-be-identified speech segment group according to a speech segment type of each to-be-identified speech segment in the to-be-identified speech segment group, the smoothing processing module 340 is configured to:
and when the number of the continuous spliced voice sections in the voice group to be identified exceeds a preset second number, determining the splicing type of the voice group to be identified as continuous splicing.
Optionally, the splice point identification module 360 is configured to determine the number of the voice splice points of the spliced voice based on the number of the target merged spliced voice segments in the spliced voice, and when determining the voice splice position of the spliced voice based on the relative position of the target merged spliced voice segments in the spliced voice, the splice point identification module 360 is configured to:
determining the total number of target combined spliced voice segments included in the spliced voice as the number of voice splicing points of the spliced voice;
determining the mapping position of the middle position of each target merged spliced voice segment in the voice to be identified according to the mapping relation between the spliced voice and each target merged spliced voice segment;
and determining the mapping position of each target merging and splicing voice segment as the voice splicing position of the voice to be identified.
Optionally, as shown in fig. 4, the identification apparatus 300 further includes a model building module 370, where the model building module 370 is configured to:
acquiring a voice training sample set formed by natural voice, homologous splicing voice and heterologous splicing voice; the frame number of each voice training sample in the voice training sample set is the same;
aiming at each voice training sample in the voice training sample set, performing voice feature extraction on the voice training sample by adopting a plurality of voice feature extraction methods to obtain a plurality of voice features of the voice training sample;
for each voice training sample in the voice training sample set, determining a fused voice feature of the voice training sample based on a Fisher criterion and a plurality of voice features of the voice training sample;
and performing iterative training on a preset neural network by using the fusion voice characteristics of each voice training sample in the voice training sample set to generate a spliced voice identification model.
Optionally, the preset neural network includes an LCNN sub-network and a GRU sub-network, and the activation function of the LCNN sub-network is a CELU function.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the method in the method embodiments shown in fig. 1 and fig. 2 may be performed.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method in the method embodiments shown in fig. 1 and fig. 2 may be executed.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. An authentication method of spliced voices, the authentication method comprising:
cutting the acquired voice to be identified into a plurality of voice sections to be identified;
aiming at each voice section to be identified, extracting fused voice features for expressing the characteristics of the voice section to be identified from the voice section to be identified;
inputting the fusion voice features into a pre-trained spliced voice identification model, and determining the voice section type of the voice section to be identified;
when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, smoothing all the voice sections to be identified included in the voice to be identified, and determining whether the voice to be identified after smoothing includes a target merged spliced voice section or not;
when the voice to be identified comprises a target merged spliced voice section, determining the voice type of the voice to be identified as spliced voice, and acquiring at least one target merged spliced voice section generated after smoothing;
determining the number of voice splicing points of the spliced voice based on the number of target combined spliced voice sections in the spliced voice, and determining the voice splicing position of the spliced voice based on the relative position of the target combined spliced voice sections in the spliced voice;
when the voice segment type of any voice segment to be identified indicates that the voice segment to be identified is a spliced voice segment, performing smoothing processing on the voice to be identified to determine whether the voice to be identified comprises a target merged spliced voice segment, including:
dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified;
aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified;
when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment;
and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
2. The authentication method according to claim 1, wherein the clipping the acquired speech to be authenticated into a plurality of speech segments to be authenticated comprises:
and according to the preset window length and window movement of the cutting window, moving the cutting window on the voice to be identified according to the window movement according to the time sequence, cutting the voice positioned in the cutting window in each moving process, and cutting a plurality of voice sections to be identified.
3. An authentication method according to claim 1, wherein the speech segment types comprise: a natural speech section and a spliced speech section;
the spliced speech segments include homologous spliced speech segments and heterologous spliced speech segments.
4. The authentication method according to claim 1, wherein the voice type includes natural voice and spliced voice;
the spliced voices comprise homologous spliced voices, heterologous spliced voices and mixed spliced voices.
5. The identification method according to claim 1, wherein said inputting the fused speech feature into a pre-trained spliced speech identification model to determine the speech segment type of the speech segment to be identified comprises:
for each voice section to be identified in the voice to be identified, inputting the fusion voice characteristics of the voice section to be identified into a pre-trained voice identification model, and determining the probability that the voice section to be identified belongs to each voice section type;
and determining the voice section type corresponding to the maximum value of the probability that the voice section to be identified belongs to each voice section type as the voice section type to which the voice section to be identified belongs.
6. The method according to claim 1, wherein said determining, for each speech segment group to be identified, a splicing type of the speech segment group to be identified according to a speech segment type of each speech segment to be identified in the speech segment group to be identified comprises:
and when the number of continuous spliced voice sections in the voice section group to be identified exceeds a preset second number, determining the splicing type of the voice section group to be identified as continuous splicing.
7. The method for discriminating between the first and second segmented voices according to claim 1, wherein determining the number of voice splicing points for the segmented voice based on the number of target merged segmented voice segments in the segmented voice, and determining the voice splicing position for the segmented voice based on the relative position of the target merged segmented voice segments in the segmented voice comprises:
determining the total number of target combined spliced voice segments included in the spliced voice as the number of voice splicing points of the spliced voice;
determining the mapping position of the middle position of each target merged spliced voice segment in the voice to be identified according to the mapping relation between the spliced voice and each target merged spliced voice segment;
and determining the mapping position of each target merging and splicing voice segment as the voice splicing position of the voice to be identified.
8. The authentication method of claim 1, wherein the spliced speech authentication model is constructed by:
acquiring a voice training sample set formed by natural voice, homologous splicing voice and heterologous splicing voice; the frame number of each voice training sample in the voice training sample set is the same;
aiming at each voice training sample in the voice training sample set, performing voice feature extraction on the voice training sample by adopting a plurality of voice feature extraction methods to obtain a plurality of voice features of the voice training sample;
for each voice training sample in the voice training sample set, determining a fused voice feature of the voice training sample based on a Fisher criterion and a plurality of voice features of the voice training sample;
and performing iterative training on a preset neural network by using the fusion voice characteristics of each voice training sample in the voice training sample set to generate a spliced voice identification model.
9. The method of claim 8, wherein the neural network comprises an LCNN sub-network and a GRU sub-network, and the activation function of the LCNN sub-network is a CELU function.
10. An authentication apparatus for spliced voices, said authentication apparatus comprising:
the cutting module is used for cutting the acquired voice to be identified into a plurality of voice sections to be identified;
the extraction module is used for extracting fused voice features used for expressing the characteristics of the voice sections to be identified from the voice sections to be identified aiming at each voice section to be identified;
the voice section identification module is used for inputting the fusion voice characteristics into a pre-trained spliced voice identification model and determining the voice section type of the voice section to be identified;
the smoothing processing module is used for smoothing all the voice sections to be identified included in the voice to be identified when the voice section type of any voice section to be identified indicates that the voice section to be identified is a spliced voice section, and determining whether the voice to be identified after smoothing processing includes a target merged spliced voice section or not;
the acquiring module is used for determining the voice type of the voice to be identified as spliced voice when the voice to be identified comprises a target merged spliced voice section, and acquiring at least one target merged spliced voice section generated after smoothing processing;
the spliced point identification module is used for determining the number of the voice spliced points of the spliced voice based on the number of the target combined spliced voice sections in the spliced voice and determining the voice splicing position of the spliced voice based on the relative position of the target combined spliced voice sections in the spliced voice;
the smoothing processing module is configured to, when a speech segment type of any speech segment to be identified indicates that the speech segment to be identified is a spliced speech segment, smooth the speech to be identified, and determine whether the speech to be identified includes a target merged spliced speech segment, the smoothing processing module is configured to:
dividing the voice sections to be identified into at least one voice section group to be identified according to the time sequence; the voice section group to be identified comprises a preset first number of voice sections to be identified, wherein the preset first number of voice sections to be identified are time-continuous voice sections to be identified;
aiming at each voice segment group to be identified, determining the splicing type of each voice segment group to be identified according to the voice segment type of each voice segment to be identified in the voice segment group to be identified;
when the splicing type of the voice segment group to be identified is continuous splicing, combining a preset first number of voice segments to be identified in the voice segment group to be identified to generate a synthesized voice segment, and determining the synthesized voice segment as a target combined spliced voice segment;
and when the splicing type of any voice section group to be identified included in the voice to be identified is continuous splicing, determining that the voice to be identified includes a target merging spliced voice section.
11. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions being executed by the processor to perform the steps of the method for authenticating a spliced voice according to any one of claims 1 to 9.
12. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for authenticating a spliced speech as claimed in any one of claims 1 to 9.
CN202111072051.1A 2021-09-14 2021-09-14 Spliced voice identification method and device, electronic equipment and storage medium Active CN113516969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111072051.1A CN113516969B (en) 2021-09-14 2021-09-14 Spliced voice identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111072051.1A CN113516969B (en) 2021-09-14 2021-09-14 Spliced voice identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113516969A CN113516969A (en) 2021-10-19
CN113516969B true CN113516969B (en) 2021-12-14

Family

ID=78063336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111072051.1A Active CN113516969B (en) 2021-09-14 2021-09-14 Spliced voice identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113516969B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543796B (en) * 2023-07-06 2023-09-15 腾讯科技(深圳)有限公司 Audio processing method and device, computer equipment and storage medium
CN118016051B (en) * 2024-04-07 2024-07-19 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782877A (en) * 2019-11-19 2020-02-11 合肥工业大学 Speech identification method and system based on Fisher mixed feature and neural network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5799279A (en) * 1995-11-13 1998-08-25 Dragon Systems, Inc. Continuous speech recognition of text and commands
EP2996269A1 (en) * 2014-09-09 2016-03-16 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio splicing concept
CN108538312B (en) * 2018-04-28 2020-06-02 华中师范大学 Bayesian information criterion-based automatic positioning method for digital audio tamper points
CN108831506B (en) * 2018-06-25 2020-07-10 华中师范大学 GMM-BIC-based digital audio tamper point detection method and system
CN109284717A (en) * 2018-09-25 2019-01-29 华中师范大学 It is a kind of to paste the detection method and system for distorting operation towards digital audio duplication
CN111009238B (en) * 2020-01-02 2023-06-23 厦门快商通科技股份有限公司 Method, device and equipment for recognizing spliced voice
CN112102825B (en) * 2020-08-11 2021-11-26 湖北亿咖通科技有限公司 Audio processing method and device based on vehicle-mounted machine voice recognition and computer equipment
CN112992126B (en) * 2021-04-22 2022-02-25 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782877A (en) * 2019-11-19 2020-02-11 合肥工业大学 Speech identification method and system based on Fisher mixed feature and neural network

Also Published As

Publication number Publication date
CN113516969A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
CN113516969B (en) Spliced voice identification method and device, electronic equipment and storage medium
CN107393554B (en) Feature extraction method for fusion inter-class standard deviation in sound scene classification
CN108831506B (en) GMM-BIC-based digital audio tamper point detection method and system
CN110852215A (en) Multi-mode emotion recognition method and system and storage medium
CN111081223B (en) Voice recognition method, device, equipment and storage medium
CN112712809B (en) Voice detection method and device, electronic equipment and storage medium
CN110910891A (en) Speaker segmentation labeling method and device based on long-time memory neural network
CN111340035A (en) Train ticket identification method, system, equipment and medium
CN106302987A (en) A kind of audio frequency recommends method and apparatus
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
KR102334018B1 (en) Apparatus and method for validating self-propagated unethical text
CN113808612B (en) Voice processing method, device and storage medium
CN115565533A (en) Voice recognition method, device, equipment and storage medium
CN111785303A (en) Model training method, simulated sound detection method, device, equipment and storage medium
CN116153336B (en) Synthetic voice detection method based on multi-domain information fusion
CN116935889A (en) Audio category determining method and device, electronic equipment and storage medium
KR102113879B1 (en) The method and apparatus for recognizing speaker's voice by using reference database
Lee et al. Automatic melody extraction algorithm using a convolutional neural network
KR101925248B1 (en) Method and apparatus utilizing voice feature vector for optimization of voice authentication
CN111475634B (en) Representative speaking segment extraction device and method based on seat voice segmentation
Luna-Jiménez et al. GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020
Gupta et al. Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice
Darekar et al. Toward improved performance of emotion detection: multimodal approach
CN113495974B (en) Sound classification processing method, device, equipment and medium
CN116451678B (en) Data relation recognition and data table integration method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant