CN112434263A

CN112434263A - Method and device for extracting similar segments of audio file

Info

Publication number: CN112434263A
Application number: CN202011101357.0A
Authority: CN
Inventors: 徐单恒
Original assignee: Hangzhou Ancun Network Technology Co ltd
Current assignee: Hangzhou Ancun Network Technology Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-03-02

Abstract

The application provides a method and a device for extracting similar segments of audio files, which comprises the steps of obtaining a first audio file and a second audio file; extracting audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data; framing the first audio signal data and the second audio signal data; the first audio signal data and the second audio signal data which are subjected to framing processing are cut out in a sliding mode according to preset duration and step length to obtain a first audio clip and a second audio clip; calculating the similarity of the first audio segment and the second audio segment; and extracting similar audio segments in the first audio file and the second audio file. According to the technical scheme of the application, whether two audios are similar or not is judged and the similar audio segments are extracted by intercepting the audio segments and sequentially calculating the similarity of the intercepted audio segments, so that an infringed person can conveniently match and search an infringement part in the infringed audio.

Description

Method and device for extracting similar segments of audio file

Technical Field

The application relates to the field of audio copyright verification, in particular to a method and a device for extracting similar segments of an audio file.

Background

At present, along with the improvement of intellectual property consciousness of people, an audio producer has higher and higher ownership consciousness of own audio works, and after infringement behaviors such as audio stealing recording or singing and the like occur, infringement evidence can be searched for at the first time. In the prior art, original audio and infringement audio are both taken as evidence, but sometimes the original audio and the infringement audio are not recorded illegally. Therefore, time is wasted in checking the evidence.

In this background section, the above information disclosed is only for enhancement of understanding of the background of the application and therefore it may contain prior art information that does not constitute a part of the common general knowledge of a person skilled in the art.

Disclosure of Invention

The application provides a method and a device for extracting similar segments of an audio file, which are convenient for an infringed person to match and search an infringed part in infringed audio, and save the verification time of the infringed person.

According to an aspect of the present application, a method for extracting similar segments of audio files is provided, which includes obtaining a first audio file and a second audio file; extracting audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data; frame-dividing the first audio signal data and the second audio signal data, and time-dividing the first audio signal data and the second audio signal data; the first audio signal data and the second audio signal data which are subjected to framing processing are cut out in a sliding mode according to preset duration and step length to obtain a first audio clip and a second audio clip; calculating the similarity of the first audio segment and the second audio segment; and according to the result of the similarity, extracting similar audio clips in the first audio file and the second audio file.

According to some embodiments, the method further comprises pre-processing the first audio signal data and the second audio signal data before framing the first audio signal data and the second audio signal data.

According to some embodiments, pre-processing the first audio signal data and the second audio signal data comprises normalizing the first audio signal data and the second audio signal data; denoising the normalized first audio signal data and the normalized second audio signal data.

According to some embodiments, denoising the normalized first audio signal data and the normalized second audio signal data comprises using a window function to obtain function window data; performing convolution operation on the normalized first audio signal data and the normalized second audio signal data respectively by using the function window data; and respectively taking out the first audio signal data and the function window data, and the completely overlapped region data after convolution operation of the second audio signal data and the function window data.

According to some embodiments, the obtaining the function window data using the window function includes calculating weights using a hanning function to generate a hanning window, obtaining the function window data.

According to some embodiments, calculating the similarity of the first audio piece and the second audio piece comprises setting an audio similarity threshold; sequentially calculating a similarity of the first audio piece of the first audio signal data and the second audio piece of the first audio signal data using a cross-correlation function. And comparing the similarity with the similarity threshold value.

According to some embodiments, extracting similar audio segments of the first and second audio files comprises recording start times and end times of the first and second audio segments, respectively, if the similarity is greater than the similarity threshold; respectively sequencing the first audio clip and the second audio clip according to the starting time of the audio clips; merging the first audio segment and the second audio segment with adjacent or intersecting start times; updating start and end times of the first and second audio segments; extracting similar audio segments of the first audio file and the second audio file respectively by using the updated start time and end time of the first audio segment and the second audio segment.

According to an aspect of the present application, an apparatus for extracting similar segments of an audio file is provided, which includes: the audio acquisition module is used for acquiring a first audio file and a second audio file; the audio data conversion module is used for extracting audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data; the audio segmentation module is used for intercepting the first audio signal data and the second audio signal data in a sliding mode according to a preset frame rate, a preset duration and a preset step length to obtain a first audio segment and a second audio segment; the audio similarity calculation module is used for calculating the similarity of the first audio segment and the second audio segment; and the audio extraction module is used for extracting the audio similar segments of the first audio file and the second audio file.

According to some embodiments, the aforementioned apparatus further comprises an audio pre-processing module for pre-processing the first audio signal data and the second audio signal data.

According to some embodiments, the audio pre-processing module comprises a normalization module and a noise reduction module; the normalization module is configured to normalize the first audio signal data and the second audio signal data; the noise reduction module is used for reducing noise of the normalized first audio signal data and the normalized second audio signal data.

According to an aspect of the application, an electronic device is presented, comprising one or more processors; storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as described above.

According to an aspect of the application, a computer-readable medium is proposed, on which a computer program is stored, comprising program which, when executed by a processor, carries out the method as described above.

According to the method and the device for extracting the similar segments of the audio file, provided by the embodiment of the application, whether the two audios are similar or not is judged and the similar audio segments are extracted by intercepting the audio segments and sequentially calculating the similarity of the intercepted audio segments, so that an infringed person can conveniently match and search an infringed part in the infringed audio, and the verification time of the infringed person is saved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following detailed description of exemplary embodiments thereof, which is to be read in connection with the accompanying drawings for the purpose of more clearly illustrating the technical solutions in the embodiments of the present application. It will be clear to a person skilled in the art that other figures can also be obtained from these figures without going beyond the scope of protection claimed by the present application.

Fig. 1 shows a flowchart of a method of extracting similar segments of an audio file according to an example embodiment of the present application.

Fig. 2 shows a flow chart of a method of pre-processing audio signal data according to an example embodiment of the present application.

Fig. 3 shows a flow chart of a method of denoising audio signal data according to an example embodiment of the present application.

Fig. 4 shows a flowchart of a method of calculating similarity of audio signal data and extracting similar audio file segments according to an example embodiment of the present application.

Fig. 5 illustrates an apparatus diagram for extracting similar segments of an audio file according to an example embodiment of the present application.

Fig. 6 illustrates another apparatus diagram for extracting similar sections of an audio file according to an exemplary embodiment of the present application.

FIG. 7 illustrates an apparatus diagram for pre-processing audio signal data according to an example embodiment of the present application

Fig. 8 shows a block diagram of an electronic device according to an example embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other means, components, materials, devices, or operations. In such cases, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Apparatus embodiments of the present application may be used to perform method embodiments of the present application. For details not disclosed in the apparatus embodiments of the present application, reference may be made to the method embodiments of the present application.

Technical solutions according to embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a method of extracting similar segments of an audio file according to an example embodiment of the present application. A method of extracting similar sections of an audio file according to an embodiment of the present application will be described in detail below with reference to fig. 1.

Referring to fig. 1, at S101, a first audio file and a second audio file are acquired.

According to some example embodiments of the present application, the first audio file and the second audio file are a copyright audio file and an infringement audio file, respectively, which need to be determined whether to infringe. The first audio file and the second audio file can be stored locally or in a cloud.

In S103, audio signal data of the first audio file and the second audio file are extracted to obtain first audio signal data and second audio signal data.

According to some example embodiments of the present application, an audio signal of an audio file uploaded by a user may be read with an open source library pydub.

According to some example embodiments of the present application, after the audio signal data of the first audio file and the second audio file are acquired, a preprocessing operation needs to be performed on the audio signal processing. According to some examples of the application, the preprocessing operation may be performed on the audio signal data using an open source library numpy. The specific steps for the pre-treatment can be seen in figure 2.

At S105, the first audio signal data and the second audio signal data are frame-processed.

Since the first audio signal data and the second audio signal data extracted at S103 are the audio array and the audio attribute information, and the comparison audio similarity is compared in units of time, it is necessary to divide the first audio signal data and the second audio signal data in time between the processes. According to some example embodiments of the present application, a frame-dividing process is performed on first audio signal data and second audio signal data using a frame rate, and the first audio signal data and the second audio signal data are divided into blocks. The audio attribute information includes frame rate information of audio, that is, several frames of data are read per second, and one audio data is a frame.

At S107, the first audio signal data and the second audio signal data are cut out in a sliding manner with a predetermined duration and step size to obtain a first audio segment and a second audio segment.

According to some embodiments of the present application, the sliding segmentation is performed on the first audio signal data and the second audio signal data by a predetermined length and step size using the blocking result of the framing process of S105. Specific segmentation operations can be seen in tables 1-3.

At S109, the similarity of the first audio piece and the second audio piece is calculated.

According to some example embodiments of the present application, the similarity of the first audio piece and the second audio piece is calculated using a cross-correlation function. As will be readily understood by those skilled in the art, the correlation of two sets of data is typically compared using a cross-correlation function. For a theoretical basis for selecting the cross-correlation function for comparing the similarity of audio data, see the following analysis. The closer the absolute value of the result calculated by the cross-correlation function is to 1, the greater the similarity of the two audio pieces is; the closer the absolute value of the result calculated by the cross-correlation function is to 0, the less similarity of the two audio pieces is indicated. When the absolute value of the result of the cross-correlation function calculation is equal to 0, it indicates that the two audio pieces are completely dissimilar.

At S111, similar audio clips in the first audio file and the second audio file are extracted according to the result of the similarity.

According to some example embodiments of the present application, using the result of the calculation of S109, the start time and the end time of the similar audio piece are determined, and then the pieces of similar audio of the first audio file and the second audio file are extracted.

According to some example embodiments of the present application, specific steps of calculating the similarity of the first audio piece and the second audio piece and extracting similar audio pieces of the first audio file and the second audio file may refer to fig. 4.

Referring to fig. 2, according to some example embodiments of the present application, the process of preprocessing includes normalizing audio signal data S201 and a noise reduction audio signal S203.

At S201, the first audio signal data and the second audio signal data are normalized.

The audio signal data is normalized to eliminate the dimensional influence between data between the first audio signal data and the second audio signal data, and to make the two audio data comparable. According to some embodiments of the present application, the normalization methods include maximum-minimum normalization, average normalization, and non-linear normalization.

At S203, the normalized first audio signal data and second audio signal data are denoised.

The purpose of denoising the normalized first audio signal data and the normalized second audio signal data is to remove the erroneous peaks in the audio signal data to reduce the influence of noise on the result. Specific steps regarding pre-processing may be seen in fig. 3, according to some embodiments of the present application.

Referring to fig. 3, at S2031, function window data is generated using a window function.

The function window is generated by calling a window function. Common window functions are rectangular windows (i.e., no window), triangular windows, hanning windows, hamming windows, gaussian windows, etc. According to some example embodiments of the present application, the hanning window data is generated by computing weights by calling a hanning window function.

At S2033, the normalized first audio signal data and the normalized second audio signal data are respectively subjected to convolution operation using the function window data.

In S2035, the completely overlapped region after convolution operation of the first audio signal data and the function window data and the completely overlapped region data after convolution operation of the second audio signal data and the function window data are taken out, that is, the first audio signal data after noise reduction and the second audio signal data after noise reduction.

Referring to fig. 4, at S1071, an audio similarity threshold is set.

According to some example embodiments of the present application, a cross-correlation function is used to calculate the similarity of two audio pieces. The closer the absolute value of the result calculated by the cross-correlation function is to 1, the greater the similarity of the two audio pieces is; the closer the absolute value of the result calculated by the cross-correlation function is to 0, the less similarity of the two audio pieces is indicated. When the absolute value of the result of the cross-correlation function calculation is equal to 0, it indicates that the two audio pieces are completely dissimilar. According to some example embodiments of the present application, the audio similarity threshold is set to 0.7.

At S1073, the similarity of the first audio piece of the first audio signal data and the second audio piece of the first audio signal data is sequentially calculated using the cross-correlation function.

According to some example embodiments of the present application, it is desirable to compare a first audio piece of the first audio signal data and a first audio piece of the first audio signal data two by two.

The magnitude of the similarity and the similarity threshold value are compared at S1075.

At S1091, if the similarity is greater than the correlation threshold, recording start and end times of the first and second audio clips, respectively.

According to some example embodiments of the present application, if the calculated similarity of two audio segments is greater than a set audio similarity threshold, the start time and the end time of the two audio segments are recorded. According to the step, pairwise similarities of all first audio segments of the first audio signal data and all second audio segments of the second audio signal data are sequentially calculated.

At S1093, the first audio piece and the second audio piece are sorted by the start time of the audio piece, respectively.

According to some example embodiments of the present application, all similar first audio pieces of the first audio data and all similar second audio pieces of the second audio data are respectively ordered using a start time of the recording.

At S1095, the first audio piece and the second audio piece of adjacent or intersecting start times are merged.

According to some example embodiments of the present application, after sorting all similar first audio pieces of the first audio data and all similar second audio pieces of the second audio data, a merging operation is performed on audio pieces having neighboring or intersecting start times.

At S1097, the start times and end times of the first audio piece and the second audio piece are updated.

According to some example embodiments of the present application, after merging audio segments with adjacent or intersecting start times, the start times and the end times of the first audio segment and the second audio segment need to be updated, so as to obtain final similar start times and end times of the first audio segment and the second audio segment.

At S1099, similar audio clips of the first audio file and the second audio file are extracted using the updated start time and end time of the first audio clip and the second audio clip, respectively.

According to some example embodiments of the present application, similar audio segments of the first audio file and the second audio file are extracted according to the start time and the end time of the similar audio segments of the first audio file and the second audio file obtained in S1097.

According to the technical scheme, the similarity of the intercepted audio frequency fragments is sequentially calculated by intercepting the audio frequency fragments, whether the two audio frequencies are similar or not is judged, the similar audio frequency fragments are extracted, an infringement person can conveniently match and search an infringement part in the infringement audio frequency, and the efficiency and the accuracy of searching the infringement audio frequency are improved.

The following describes the specific implementation method according to the technical solution of the present application in detail according to the specific implementation examples shown in tables 1 to 3.

According to some example embodiments of the present application, if the sampling rates of the first audio file and the second audio file are not identical, the audio file with the high sampling rate needs to be resampled, and the used sampling frequency is consistent with the audio with the low sampling rate. For example, the original audio is 44khz, and the infringement audio is 22khz, which is calculated by 22 khz. I.e., the original audio array content is sampled at intervals (e.g., [1,2,3,4, … ] into [1,3, … ]).

For convenience of description, it is assumed that the first audio file and the second audio file have the same adoption rate and duration. The following takes the first audio file as an example, and details the segmentation process. The segmentation process of the second audio file is consistent with that of the first audio file, and is not described again.

Assuming that the total length of the audio signal data of the read first audio file is 1000 and the frame rate is 20 audio signal data per second, a total of 50 seconds are required to read the 1000 audio signal data.

When the 1000 audio signal data are segmented, firstly, frame division processing is performed according to a frame rate, that is, the 1000 audio signal data are segmented according to the number of frames read per second.

Assuming that the frame rate is 20 audio signal data per second, the block structure of 1000 audio signal data is as shown in table 1.

Table 1: audio signal data blocking result

Block name	Audio frame start and end positions
		Second 1	0～20
Second 2	21～40
		Second 3	41～60
Second 4	61～80
		Second 5	81～100
6 th second	101～120
		7 th second	121～140
Second 8	141～160
		9 th second	161～180
Second 10	181～200
		…	…
40 th second	781～800
		41 th second	801～820
42 th second	821～840
		43 th second	841～860
44 th second	861～880
		Second 45	881～900
46 th second	901～920
		47 th second	921～940
48 th second	941～960
		Second 49	961～980
50 th second	981～1000

The 50-second audio data will be merged according to the set step size and time length, and assuming that the step size is 5 seconds and the time length is 10 seconds, the segmentation result of the 50-second audio data is shown in table 2. The segmentation is performed by using the duration as a window and sliding and moving the 50-second audio data according to the step value.

Table 2: audio signal data segmentation results.

Name of segment	Segment time
		Paragraph 1	1s～10s
Paragraph 2	5s～15s
		Paragraph 3	10s～20s
Paragraph 4	15s～25s
		Paragraph 5	20s～30s
Paragraph 6	25s～35s
		Paragraph 7	30s～40s
Paragraph 8	35s～45s
		Paragraph 9	40s～50s

According to some example embodiments of the present application, when segmenting audio, if the first audio signal data and the second audio signal data are segmented with a duration of 10 seconds. Then 0 is complemented for the audio piece less than 10 seconds, and the audio data is complemented to 10 seconds. For example, the first audio signal data may take 46 seconds and the second audio signal data may take 48 seconds. If the duration is segmented by 10 seconds, the last group of data of the first audio signal data is 40-50 seconds, and empty data of 4 seconds is required to be complemented; the last set of data of the second audio signal data is 40-50 seconds, and 2 seconds of empty data needs to be complemented.

After the first audio signal data and the second audio signal data are segmented, the first audio segment of the first audio signal data and the second audio segment of the second audio signal data need to be compared in sequence, and the specific comparison mode is shown in table 3.

Table 3: audio clip comparison table

First audio clip	Second audio clip
		Paragraph 1	Paragraph 1
Paragraph 1	Paragraph 2
		Paragraph 1	Paragraph 3
Paragraph 1	Paragraph 4
		Paragraph 1	Paragraph 5
Paragraph 1	Paragraph 6
		Paragraph 1	Paragraph 7
Paragraph 1	Paragraph 8
		Paragraph 1	Paragraph 9
Paragraph 2	Paragraph 1
		Paragraph 2	Paragraph 2
Paragraph 2	Paragraph 3
		Paragraph 2	Paragraph 4
Paragraph 2	Paragraph 5
		Paragraph 2	Paragraph 6
Paragraph 2	Paragraph 7
		Paragraph 2	Paragraph 8
Paragraph 2	Paragraph 9
		…	…
Paragraph 9	Paragraph 1
		Paragraph 9	Paragraph 2
Paragraph 9	Paragraph 3
		Paragraph 9	Paragraph 4
Paragraph 9	Paragraph 5
		Paragraph 9	Paragraph 6
Paragraph 9	Paragraph 7
		Paragraph 9	Paragraph 8
Paragraph 9	Paragraph 9

After comparing the first audio segment of the first audio signal data and the second audio segment of the second audio signal data pairwise, if the comparison result is greater than a preset audio similarity threshold value, recording the start time and the end time of the first audio segment and the second audio segment.

Assume that similar audio segments of the first audio file and the second audio file are [ (0,10), (5,15), (25,35) ] and [ (10,20), (25,35), (35,45) ], respectively. After obtaining the similar audio segments, merging the similar audio segments with adjacent or intersected start times according to the start time and the end time of the similar audio segments, wherein the merging results of the similar audio segments of the first audio file and the second audio file are [ (0,15), (25,35) ] and [ (10,20), (25,45) ], respectively. Then, the frame rate is multiplied by the merged similar audio segments, which are the actual similar audio segments, i.e., [ (0,15 × 20), (25 × 20,35 × 20) ] and [ (10 × 20,20 × 20), (25 × 20,45 × 20) ].

And respectively extracting the similar audio segments of the first audio file and the second audio file according to the obtained similar audio segments, namely the final desired infringement audio evidence file.

As shown in fig. 5, the apparatus for extracting similar segments of an audio file includes an audio acquisition module 501, an audio data conversion module 503, an audio segmentation module 505, an audio similarity calculation module 507, and an audio extraction module 509.

The audio obtaining module 501 is configured to obtain a first audio file and a second audio file.

The audio data conversion module 503 is configured to extract audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data.

And an audio segmentation module 505, configured to slidingly intercept the first audio signal data and the second audio signal data at a predetermined frame rate, duration, and step length to obtain a first audio segment and a second audio segment.

An audio similarity calculation module 507, configured to calculate a similarity between the first audio clip and the second audio clip.

An audio extracting module 509, configured to extract audio similar segments of the first audio file and the second audio file.

Fig. 6 is a diagram of another apparatus for extracting similar segments of an audio file according to further embodiments of the present application. Compared to fig. 5, there is an addition of the audio pre-processing module 511.

An audio pre-processing module 511, configured to pre-process the first audio signal data and the second audio signal data.

According to some example embodiments of the present application, as shown in fig. 7, the audio pre-processing module 511 includes an audio normalization module 5111 and an audio noise reduction module 5113.

The normalization module 5111 is configured to normalize the first audio signal data and the second audio signal data;

the noise reduction module 5113 is configured to reduce noise of the normalized first audio signal data and the normalized second audio signal data.

FIG. 8 shows a block diagram of an electronic device according to an example embodiment.

An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 8. The electronic device 200 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code that can be executed by the processing unit 210 such that the processing unit 210 performs the methods according to various exemplary embodiments of the present application described herein. For example, processing unit 210 may perform a method as shown in fig. 2.

The storage unit 220 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The theoretical basis for comparing the similarity of two sets of audio data using a cross-correlation function is detailed below.

In practical applications, the similarity between two audio signals can be compared by using the similarity degree of the waveforms of the audio signals. Assuming that the two signals are x (t), y (t), respectively, an appropriate multiple a may be selected to make a x y (t) approach x (t). The value of the value a can be obtained by deriving an extreme value from an error energy function, wherein the error energy is shown as the following formula:

after derivation, the optimal value of a is obtained as:

substituting the optimal value (2) of a into an error energy formula (1) to obtain a minimum value of error energy, which is shown as the following formula:

the first term on the right side of the equation represents the energy of the original signal x (t). If the above equation is normalized to relative error by the original signal energy, then there are:

order to

The normalized error energy function (4) can be expressed as:

usually let ρ be_xyReferred to as the correlation coefficient of signals y (t) and x (t), where x (t) and y (t) are both signals, ρ_xyIs a real number; furthermore, according to the schwarz inequality of the integral:

therefore, the following steps are carried out: | ρ_xy|≤1

Since the energy is determined for energy limited signals, the correlation coefficient p_xyIs determined only by the integral of x (t) y (t). If the amplitude value and the appearance time of two completely dissimilar waveforms are independent and independent, x (t) y (t) is 0, the integral result is also 0, so the two waveforms are equalThe correlation coefficient is 0 with the worst similarity, i.e. uncorrelated. When the correlation coefficient is 1, the error energy is 0, which indicates that the two signals have good similarity and are linearly correlated. It is therefore entirely reasonable to use the correlation coefficient as a measure of the similarity (or linear correlation) of two signal waveforms.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. The technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiments of the present application.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions described above.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present application.

According to some embodiments of the application, whether the two audios are similar or not is judged and the similar audio segments are extracted by intercepting the audio segments and sequentially calculating the similarity of the intercepted audio segments, so that an infringed person can conveniently match and search an infringement part in the infringed audio, and the efficiency and the accuracy of searching the infringed audio are improved.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the description of the embodiments is only intended to facilitate the understanding of the methods and their core concepts of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of extracting similar segments of an audio file, comprising:

acquiring a first audio file and a second audio file;

extracting audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data;

frame-dividing the first audio signal data and the second audio signal data, and time-dividing the first audio signal data and the second audio signal data;

the first audio signal data and the second audio signal data which are subjected to framing processing are cut out in a sliding mode according to preset duration and step length to obtain a first audio clip and a second audio clip;

calculating the similarity of the first audio segment and the second audio segment;

and according to the result of the similarity, extracting similar audio clips in the first audio file and the second audio file.

2. The method of claim 1, further comprising, prior to framing the first audio signal data and the second audio signal data:

pre-processing the first audio signal data and the second audio signal data.

3. The method of claim 2, wherein pre-processing the first audio signal data and the second audio signal data comprises:

normalizing the first audio signal data and the second audio signal data;

denoising the normalized first audio signal data and the normalized second audio signal data.

4. The method of claim 3, wherein denoising the normalized first audio signal data and the normalized second audio signal data comprises:

obtaining function window data by using a window function;

performing convolution operation on the normalized first audio signal data and the normalized second audio signal data respectively by using the function window data;

and respectively taking out the first audio signal data and the function window data, and the completely overlapped region data after convolution operation of the second audio signal data and the function window data.

5. The method of claim 4, wherein the using a window function to obtain function window data comprises:

and calculating the weight by using a Hanning function to generate a Hanning window, and obtaining the function window data.

6. The method of claim 1, wherein calculating the similarity of the first audio segment and the second audio segment comprises:

setting an audio similarity threshold;

sequentially calculating a similarity of the first audio piece of the first audio signal data and the second audio piece of the first audio signal data using a cross-correlation function.

And comparing the similarity with the similarity threshold value.

7. The method of claim 6, wherein extracting similar audio segments of the first audio file and the second audio file comprises:

if the similarity is greater than the similarity threshold, recording the start time and the end time of the first audio clip and the second audio clip respectively;

respectively sequencing the first audio clip and the second audio clip according to the starting time of the audio clips;

merging the first audio segment and the second audio segment with adjacent or intersecting start times;

updating start and end times of the first and second audio segments;

extracting similar audio segments of the first audio file and the second audio file respectively by using the updated start time and end time of the first audio segment and the second audio segment.

8. An apparatus for extracting similar segments of an audio file, comprising:

the audio acquisition module is used for acquiring a first audio file and a second audio file;

the audio data conversion module is used for extracting audio signal data of the first audio file and the second audio file to obtain first audio signal data and second audio signal data;

the audio segmentation module is used for intercepting the first audio signal data and the second audio signal data in a sliding mode according to a preset frame rate, a preset duration and a preset step length to obtain a first audio segment and a second audio segment;

the audio similarity calculation module is used for calculating the similarity of the first audio segment and the second audio segment;

and the audio extraction module is used for extracting the audio similar segments of the first audio file and the second audio file.

9. The apparatus of claim 8, further comprising:

an audio pre-processing module for pre-processing the first audio signal data and the second audio signal data.

10. The apparatus of claim 9, wherein:

the audio preprocessing module comprises a normalization module and a noise reduction module;

the normalization module is configured to normalize the first audio signal data and the second audio signal data;

the noise reduction module is used for reducing noise of the normalized first audio signal data and the normalized second audio signal data.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

12. A computer-readable medium having a computer program stored thereon, comprising:

the program when executed by a processor implementing the method of any one of claims 1 to 7.