CN114495984A - Real-time audio stream comparison method and system - Google Patents

Real-time audio stream comparison method and system Download PDF

Info

Publication number
CN114495984A
CN114495984A CN202210335546.7A CN202210335546A CN114495984A CN 114495984 A CN114495984 A CN 114495984A CN 202210335546 A CN202210335546 A CN 202210335546A CN 114495984 A CN114495984 A CN 114495984A
Authority
CN
China
Prior art keywords
audio
frame
matching
main
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210335546.7A
Other languages
Chinese (zh)
Other versions
CN114495984B (en
Inventor
田野
彭建川
奚新明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanling Technology Co ltd
Original Assignee
Beijing Lanling Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanling Technology Co ltd filed Critical Beijing Lanling Technology Co ltd
Priority to CN202210335546.7A priority Critical patent/CN114495984B/en
Publication of CN114495984A publication Critical patent/CN114495984A/en
Application granted granted Critical
Publication of CN114495984B publication Critical patent/CN114495984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Abstract

The invention relates to a method and a system for comparing real-time audio streams, belongs to the technical field of audio processing, and solves the problems of insensitivity to tiny changes of audio and low recognition sensitivity in the prior art. Receiving real-time main and standby audio streams, and preprocessing to obtain main and standby audio frame signals; extracting logfbank characteristic vectors from the main and standby audio frame signals and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared; respectively adding the fingerprint data of the main and standby audio frames to be compared into a main and standby queue, sequentially performing initial matching and continuous matching to obtain matching frames, and obtaining delay time according to the position difference of the matching frames; wherein the initial matching is used for obtaining an initial matching frame; according to the continuous matching, audio pair matrixes to be compared are obtained and are compared in sequence according to the initial matching frame, if the matching frame is obtained, the initial matching frame is updated, and the audio pair matrixes to be compared are obtained again and are compared in sequence; otherwise, updating the main/standby queue and performing initial matching again. A robust high-sensitivity audio comparison is achieved.

Description

Real-time audio stream comparison method and system
Technical Field
The invention relates to the technical field of audio processing, in particular to a method and a system for comparing real-time audio streams.
Background
In the broadcast field, in order to ensure safe broadcast, independent main and standby streams are generally adopted for transmission, and meanwhile, whether the data of the two streams are consistent needs to be monitored in real time. If the main path flow is affected by interference, the abnormity can be found by comparing the main path flow and the standby path flow, and then the broadcasting can be ensured not to be affected by switching the standby path flow. Therefore, the comparison of audio streams is a common technical means for ensuring safe broadcasting.
In broadcast play-out, it is often necessary to perform time alignment of audio streams, usually by adding delay to the streams. Sometimes, a momentary delay increase will cause a noticeable sound change, which will cause a noticeable disturbance to the program listening, so to ensure the smoothness of the delay, a gradual delay increase is needed, for example, a delay of 1 second may require a 30-second process to complete, and a delay of 10 seconds may require 5 minutes. In the process of increasing the delay, although the human ears hardly sense, the whole audio data stream is slightly deformed, so that the traditional strict data comparison mode fails, and the system can judge that the audio data in the period is abnormal, so that the original normal broadcasting gives an abnormal alarm.
In order to deal with the false alarm problem, two paths of approximate audio streams need to be compared, and meanwhile, good identification capability on small interference is guaranteed. In addition, in order to ensure the timeliness of the alarm, the whole processing process must have high real-time performance.
In the aspect of comparing real-time audio streams, there are two main current methods: the first is to compare the audio stream data directly, and the second is to compare the audio stream after converting the audio stream into frequency spectrum. At present, in the two modes, the first mode is only suitable for two paths of stream data with extremely small changes in effect, and once the small sound deformation is caused, such as by equipment encoding and decoding, or delay, the comparison is easy to fail; the second mode is to compare frequency spectrums, the general precision is in the second level, meanwhile, some obvious interference sounds cannot be detected, and the missing rate is high.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention are directed to a method and a system for comparing real-time audio streams, so as to solve the existing problems of insensitivity to small changes in audio and low recognition sensitivity.
In one aspect, an embodiment of the present invention provides a method for comparing real-time audio streams, including the following steps:
receiving real-time main and standby audio streams, and preprocessing to obtain main and standby audio frame signals;
extracting logfbank characteristic vectors for each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared;
respectively adding the fingerprint data of the main and standby audio frames to be compared into a main and standby queue, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames;
in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, obtaining a pair of audio frames with the similarity meeting a threshold condition of the audio frames as a first matching frame, and obtaining the initial matching frame according to the first matching frame; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matching frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matching frame.
Based on a further improvement of the method, converting the logfbank feature vector into a 0-1 vector comprises:
and comparing the difference value of every two adjacent characteristic values in the logfbank characteristic vector, if the difference value is greater than 0, setting the difference value to be 1, and if the difference value is less than or equal to 0, setting the difference value to be 0.
Based on the further improvement of the method, in the initial matching process, acquiring the initial matching frame according to the first matching frame includes:
respectively acquiring M frames subsequent to the first matching frame from the main and standby queues according to the first matching frame, sequentially forming M pairs of audio frames in the main and standby queues for comparison, and taking the last pair of audio frames in the M pairs of audio frames as initial matching frames if the similarity of the M pairs of audio frames meets a threshold condition; otherwise, the first matching frame is obtained again, and then the initial matching frame is obtained according to the first matching frame.
Based on the further improvement of the method, the similarity of the audio frames is calculated according to a pair of audio frame fingerprint data to be compared, and the method comprises the following steps:
calculating a Hamming distance according to a pair of audio frame fingerprint data to be compared;
obtaining a total distance according to the dimension number of 0-1 vectors in the fingerprint data of each audio frame;
and dividing the difference value of the total distance and the Hamming distance by the total distance to obtain the similarity of a pair of audio frames to be compared.
Based on further improvement of the method, the length of the main/standby queue is fixed, and in the initial matching process, if the comparison of the main/standby queue is finished and the first matching frame or the initial matching frame meeting the threshold condition is not obtained, the main/standby queue is updated according to the fingerprint data of the main/standby audio frames to be compared.
Based on the further improvement of the method, in the continuous matching process, according to the initial matching frame, the audio pair matrix to be compared is obtained, and the method comprises the following steps:
skipping a preset frame number N according to the initial matching frame to obtain a second matching frame;
and according to the respective positions of second matched frames in the main and standby queues, respectively and continuously taking out the N frames backwards, and forming an NxN two-dimensional audio pair matrix by taking the N frames in the main and standby queues as rows and columns, wherein each element in the matrix is an audio pair formed by audio frames in the N frames corresponding to the corresponding row and column.
Based on the further improvement of the method, the audio pairs in the matrix are compared in sequence, and if the matching frame is obtained, the initial matching frame is updated to the matching frame, which comprises the following steps:
based on the audio pair matrix to be compared, the audio pair to be compared in the current Tth round is taken out, which comprises the following steps: taking the audio pairs of the 1 st, … th, T-th row and the T-th column, and the audio pairs of the T-th row, 1 st, … th, T-1 column, T =1, …, N;
identifying whether an audio pair meeting a threshold condition exists according to the audio pair to be compared in the current T-th round, if so, taking the audio pair with the maximum similarity of the audio frames as a matching frame, updating the initial matching frame into the matching frame, and finishing the comparison; and if the comparison result does not exist, taking out the audio pairs to be compared in the next round from the audio pair matrix to be compared, and comparing until traversal is completed.
Based on the further improvement of the method, in the initial matching process, a matching frame is obtained along the direction from the tail part to the head part of the main/standby queue; in the continuous matching process, the matching frame is obtained along the direction from the head to the tail of the main/standby queue.
Based on the further improvement of the method, the comparison method further comprises the following steps: in the initial matching and the continuous matching, if the matching frame is not obtained, a signal with abnormal comparison is output, otherwise, a delay signal is output according to the delay time.
In another aspect, an embodiment of the present invention provides a system for comparing real-time audio streams, including: the audio stream access module is used for receiving real-time main and standby audio streams, and obtaining main and standby audio frame signals after preprocessing;
the audio fingerprint generation module is used for extracting logfbank characteristic vectors of each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared;
the audio fingerprint comparison module is used for respectively adding the fingerprint data of the main and standby audio frames to be compared into the main and standby queues, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames;
in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, obtaining a pair of audio frames with the similarity meeting a threshold condition of the audio frames as a first matching frame, and obtaining the initial matching frame according to the first matching frame; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matching frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matching frame.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. through data dimension reduction, audio frame data are quantized into a small number of logfbank characteristic vectors which contain most of information of audio frames and have good discrimination on different types of audio types; through relative conversion, the energy spectrum with the absolute value is converted into a relative 0-1 fingerprint vector, the relative strength relation of frequency is better expressed, and the stability is better when two similar streams are represented;
2. the initial matching starts from the latest data in the queue, and the stability of the matching is improved according to the confirmation of the subsequent multiframes;
3. on the basis of successful continuous matching or initial matching at the last time, the continuous matching is defaulted to be in a stable state in a short time, so that the matching efficiency is improved; each round of comparison is carried out on the latest data according to the comparison range from small to large, the prior matching in a short distance is guaranteed, the sensitivity is favorably improved, the optimal mode is obtained after the calculation of a plurality of points in each round, the matching stability is enhanced, and particularly under the condition that interference or similar sounds appear, the continuous matching algorithm has good operation stability and strong anti-interference capability.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout the drawings;
fig. 1 is a flowchart of a method for comparing real-time audio streams according to embodiment 1 of the present invention;
fig. 2 is a schematic structural diagram of a comparison system of real-time audio streams in embodiment 2 of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Example 1
A specific embodiment of the present invention discloses a method for comparing real-time audio streams, as shown in fig. 1, including the following steps:
s11: receiving real-time main and standby audio streams, and preprocessing to obtain main and standby audio frame signals;
it should be noted that the main and standby audio streams are TS streams (Transport Stream) sent in a UDP (User Datagram Protocol) multicast manner, and are decoded after being received, original audio Stream data of a corresponding broadcast program is extracted from the TS streams, and then the original audio Stream data is preprocessed, including: framing, pre-emphasis and windowing, so as to obtain main and standby audio frame signals as the basis for subsequent feature extraction.
In particular, since the audio signal is borderless, it needs to be cut into fixed audio pieces for convenience of processing. According to the audio variation characteristics, 25-30 milliseconds are generally selected for slicing, and preferably 30 milliseconds are selected as the slicing time, namely each frame contains 30 milliseconds of original audio data.
In addition, in order to avoid the problem of information leakage at the window boundary, a certain overlap region is required between audio frames, so that audio information is prevented from being missed, and unstable factors of the audio are solved. Generally, the overlapping area is about 50%. Illustratively, for each frame containing 30 milliseconds of audio data, the overlap region is set to 20 milliseconds, such that the offset of adjacent frames is 10 milliseconds.
Considering that sound generally has attenuation, the loss of high frequency signals is often greater than that of low frequency signals, and in order to improve the resolution of the high frequency signals, pre-emphasis processing is required to be performed on audio signals, and high-pass filtering is generally used for processing. Furthermore, since framing the audio may cause truncation, in order to eliminate spectral leakage at the edges at both ends of the frame, a window function is selected for windowing filtering, illustratively, a hamming window function is selected.
After the preprocessing, the obtained main and standby audio frame signals respectively have a plurality of audio frames.
S12: extracting logfbank characteristic vectors of each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain fingerprint data of the main and standby audio frames to be compared;
it should be noted that the primary and standby audio frame signals obtained after step S11 are time domain signals, fast fourier transform is performed on each audio frame to obtain a transformed frequency domain coefficient, and an energy spectrum coefficient is obtained by taking the square of the frequency domain coefficient; and generating logfbank characteristic vectors according to a Mel frequency cepstrum by using a triangular band-pass filter for the energy spectrum coefficients.
It should be noted that logfbank is an intermediate process of the MFCC, and has good effects on different types of audio types. If the dimension value of the logfbank feature vector is too high, the logfbank feature vector is easily interfered by audio frequency fluctuation, and the fingerprint stability is weak, so that the logfbank feature vector with 30-60 dimensions has better stability from the aspects of the accuracy rate and the stability of the audio frame fingerprint. Preferably, the present embodiment is configured to generate logfbank feature vectors of 41 dimensions.
Through testing, the logfbank features are directly used for similarity calculation, the fingerprint discrimination is low, mismatching is easily caused, and some obvious noises cannot be well identified. In addition, the range difference of different dimension values is large, and the comparison is easily influenced by high-value dimensions, so that the embodiment performs 0-1 transformation processing on the relative difference values of different dimensions of the logfbank characteristic, so that the method has better stability, can balance the influence of each dimension, and has both accuracy and robustness.
Specifically, converting the logfbank feature vector into a 0-1 vector comprises:
and comparing the difference value of every two adjacent characteristic values in the logfbank characteristic vector, if the difference value is greater than 0, setting the difference value to be 1, and if the difference value is less than or equal to 0, setting the difference value to be 0.
Illustratively, a logfbank feature vector of 41 dimensions is extracted for each audio frame and converted to a 0-1 vector of 40 dimensions.
S13: respectively adding the fingerprint data of the main and standby audio frames to be compared into a main and standby queue, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames; in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, obtaining a pair of audio frames with the similarity meeting a threshold condition of the audio frames as a first matching frame, and obtaining the initial matching frame according to the first matching frame; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matched frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matched frame.
It should be noted that, the length of the main/standby queue is fixed, the fingerprint data of the main/standby audio frames to be compared are respectively added into the main/standby queue, and are sequentially subjected to initial matching and continuous matching, so that after the comparison of the whole queue is completed, a new audio frame is added, and the main/standby queue is updated.
Preferably, the active/standby queue stores audio frame data within 30 seconds, and if each frame is 30 milliseconds, the queue length is 1000.
When the audio frames in the main and standby queues are compared, the similarity is calculated according to the fingerprint data of a pair of audio frames to be compared, and then whether the similarity meets the threshold condition is judged. The similarity calculation method comprises the following steps:
calculating a Hamming distance according to a pair of audio frame fingerprint data to be compared;
obtaining a total distance according to the dimension number of 0-1 vectors in the fingerprint data of each audio frame;
and dividing the difference value of the total distance and the Hamming distance by the total distance to obtain the similarity of a pair of audio frames to be compared.
Illustratively, each audio frame fingerprint data is a 40-dimensional 0-1 vector with a total distance of 40; if there are 2 dimensions different in a pair of audio frame fingerprint data to be compared, and the hamming distance is 2, then the similarity = (40-2)/40= 0.95.
Preferably, the threshold condition is set to be greater than 0.85, i.e.: when the similarity of a pair of audio frames to be compared is greater than 0.85, the pair of audio frames is successfully matched.
The process of initial matching and successive matching is described in detail below:
1) initial matching procedure
It should be noted that there is no concept of a matching starting point in the initial matching process, according to the characteristic of queue first-in first-out, an initial matching frame is obtained along the direction from the tail to the head of the main/standby queue, matching is started from the latest data in the queue, the matching point closest to the latest data is quickly determined, and the sensitivity is improved.
Preferably, the initial matching process is started after the primary and standby queues are filled with data for 1 second, the queues do not need to be completely filled, the delay processing is controlled within 1 second, and the reaction speed is increased.
Specifically, each audio frame in the main queue is sequentially compared with each audio frame in the standby queue, a pair of audio frames with the similarity meeting a threshold condition is obtained and used as a first matching frame, and then an initial matching frame is obtained according to the first matching frame.
The process of obtaining the initial matching frame according to the first matching frame comprises the following steps:
respectively acquiring M frames subsequent to the first matching frame from the main/standby queue according to the first matching frame, sequentially forming M pairs of audio frames in the main/standby queue according to the sequence, and comparing the M pairs of audio frames, wherein if the similarity of the M pairs of audio frames meets the threshold condition, the last pair of audio frames in the M pairs of audio frames is used as an initial matching frame; otherwise, the first matching frame is obtained again, and then the initial matching frame is obtained according to the first matching frame.
It should be noted that, the obtaining of the first matching frame and the obtaining of the initial matching frame are performed in the same comparison process, and if the initial matching frame is not obtained according to the first matching frame, the obtaining of the first matching frame again is performed in the previous comparison process of the main/standby queues.
And obtaining delay time according to the position difference of the matched frames, namely the delay time = the position difference multiplied by the slicing time of the audio frame, and outputting a delay signal.
It should be noted that, according to the accuracy of the fingerprint, when the similarity of the fingerprint data of the audio frame is greater than 0.85, the subsequent consecutive 3 frames of the first matching frame are matched, so that the error probability is lower than one ten thousandth, and the consecutive 5 frames are matched, which basically ensures that there is almost no error. Therefore, M is preferably set to 4 in this embodiment, and 4 frames subsequent to the first matching frame are confirmed.
Illustratively, the 3 rd audio frame B in the standby queue3With the 5 th audio frame M in the main queue5The matching is successful as the first matching frame (B)3,M5) Then, the subsequent (B) is continuously determined4,M6)、(B5,M7)、(B6,M8) And (B)7,M9) Whether the 4 pairs of audio frames are matched successfully or not, if so, (B)7,M9) For the initial matched frame, the delay time is output: (7-9) × 30= -60 msec; otherwise, the audio frames are continuously selected from the main queue and the queued queue for comparison, and the first matching frame is obtained again.
And if the comparison of the main and standby queues is finished and the first matching frame or the initial matching frame meeting the threshold condition is not obtained, updating the main and standby queues according to the fingerprint data of the main and standby audio frames to be compared. Preferably, a signal of the initial alignment anomaly is output.
The initial matching process starts from the latest data in the queue, and the stability of matching is improved according to the confirmation of the subsequent multiframes.
2) Continuous matching process
And only after the initial matching is successful, performing a continuous matching process, otherwise, continuing the initial matching. After the initial matching point is determined, continuous matching is to acquire a matching frame along the direction from the head to the tail of the main/standby queue, so that new data matching can be quickly realized, and quick response delay adjustment of real-time audio streams is ensured.
Specifically, in the continuous matching process, according to the initial matching frame, the audio pair matrix to be compared is obtained, which includes:
skipping a preset frame number N according to an initial matching frame to obtain a second matching frame;
it should be noted that: the continuous matching is carried out on the basis of successful initial matching or successful last continuous matching, the matching is carried out in a subsequent short time range of an initial matching frame without repeated verification, so that a section of audio frame is skipped based on the initial matching frame, the matching is started, and the response speed and the sensitivity are improved.
Preferably, the initial matching frame is set to match within 300 milliseconds, and when each audio frame is cut by 30 milliseconds, the second matching frame is obtained by skipping 10 frames from the initial matching frame.
And secondly, according to the respective positions of second matched frames in the main and standby queues, respectively and continuously taking out N frames backwards, and forming an NxN two-dimensional audio pair matrix by taking the N frames in the main and standby queues as rows and columns, wherein each element in the matrix is an audio pair formed by audio frames in the N frames corresponding to the corresponding rows and columns.
The number of frames continuously taken out is the same as the number of frames skipped in the previous step, and it means that the N-round comparison is performed at most based on the current second matching frame.
Specifically, the comparing is performed in sequence based on the audio pair matrix to be compared, and if the matching frame is obtained, the updating of the initial matching frame into the matching frame includes:
taking out the audio pairs to be compared in the current Tth round based on the audio pair matrix to be compared, wherein the method comprises the following steps of: taking the audio pair of the 1 st, … th, T row and the T column, and the audio pair of the T row and the 1 st, …, T-1 column, T =1, …, N;
identifying whether an audio pair meeting a threshold condition exists according to the audio pair to be compared in the current T-th round, if so, taking the audio pair with the maximum similarity of the audio frames as a matching frame, updating the initial matching frame into the matching frame, and finishing the comparison; and if the comparison result does not exist, taking out the audio pairs to be compared in the next round from the audio pair matrix to be compared, and comparing until traversal is completed.
It should be noted that, in the continuous matching process, if the matching is successful, the delay time is obtained according to the position difference of the matched frames, and a delay signal is output, otherwise, a signal with continuous comparison abnormality is output.
Illustratively, for a 10 × 10 two-dimensional matrix, only the element T in row 1 and column 1 is taken during the 1 st round of alignment11If the audio pair corresponding to the element does not meet the threshold condition, the 2 nd round comparison is carried out, and T is taken out12、T21And T22If the audio pairs corresponding to the 3 elements do not meet the threshold condition, carrying out the 3 rd round comparison and taking out the T13、T23、T33、T31And T32If these 5 elements correspond to an audio pair T33If the corresponding audio pair has the maximum similarity and meets the threshold condition, the matching is ended according to the T33The position difference of the corresponding audio frequency pair, the output delay time and the output delay time are T33And the corresponding audio pair is used as a new initial matching frame, 10 frames are skipped to obtain a new second matching frame, delay time is output, a new audio pair matrix to be compared is obtained according to the subsequent continuous 10 frames, each round of comparison is carried out again, if 10 rounds of comparison are completed and the matching frame is not obtained yet, the initial matching process is returned, the main and standby queues are updated, the initial matching is carried out again to obtain the initial matching frame, and a signal of abnormal continuous comparison is output.
Based on the audio pair matrix to be compared, the audio pair to be compared taken out in each round is increased along with the increase of the iteration times, the audio frame position difference of the main and standby queues corresponding to each element in the matrix is gradually increased, the preferential matching of new data in a short distance is ensured, the sensitivity is favorably improved, the matching stability is enhanced by taking an optimal mode after the calculation of a plurality of points in each round, and particularly under the condition of interference or similar sounds, the continuous matching algorithm has good operation stability and strong anti-interference capability.
Compared with the prior art, the comparison method of the real-time audio stream provided by the embodiment is a robust high-sensitivity audio comparison method, generates an energy spectrum by short-time audio, quantizes and transforms the energy spectrum to generate an audio fingerprint, and then compares the audio stream in real time by using the fingerprint, so that the method has good robustness, is insensitive to tiny changes of audio, and can identify an obvious audio stream mismatching phenomenon; through the two processes of initial matching and continuous matching, on the basis of stable initial matching points, continuous matching is quickly executed, the matching sensitivity is improved, and the stability of matching is enhanced through an optimal mode after calculation of a plurality of points in each round.
Example 2
In another embodiment of the present invention, a comparison system for real-time audio streams is disclosed, so as to implement the comparison method in embodiment 1. The concrete implementation of each module refers to the corresponding description in embodiment 1. As shown in fig. 2, the system includes:
the audio stream access module is used for receiving real-time main and standby audio streams, and obtaining main and standby audio frame signals after preprocessing;
the audio fingerprint generation module is used for extracting logfbank characteristic vectors of each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared;
the audio fingerprint comparison module is used for respectively adding the fingerprint data of the main and standby audio frames to be compared into the main and standby queues, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames;
in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, obtaining a pair of audio frames with the similarity meeting a threshold condition of the audio frames as a first matching frame, and obtaining the initial matching frame according to the first matching frame; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matching frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matching frame.
Preferably, the system further comprises a comprehensive studying and judging module for processing the received comparison signal, including comparing the abnormal signal and obtaining the delay signal according to the delay time.
Specifically, whether an alarm signal is output to the main control system according to the abnormal signal is judged according to the configured strategy and the threshold, and after the alarm signal is output to the main control system, the main control system completes other subsequent processes. The received abnormal signals and the received delay signals can be stored in a database to generate data records, so that subsequent data display and statistical analysis are facilitated.
For example, it may be set strategically that no alarm is performed for a certain period of time, or that an alarm signal is sent to the master control system until the number of times that an abnormal signal is received exceeds a certain threshold.
The system in the embodiment is constructed in a distributed and high-fault-tolerance mode, and is convenient to support stable processing of a large number of audio streams.
It should be noted that, the system modules and the comparison method can be referred to each other, and are not described herein again. Since the principle of the embodiment of the system is the same as that of the embodiment of the method, the system also has the corresponding technical effect of the embodiment of the method.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A method for comparing real-time audio streams is characterized by comprising the following steps:
receiving real-time main and standby audio streams, and preprocessing to obtain main and standby audio frame signals;
extracting logfbank characteristic vectors for each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared;
respectively adding the fingerprint data of the main and standby audio frames to be compared into a main and standby queue, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames;
in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, obtaining a pair of audio frames with the similarity meeting a threshold condition of the audio frames as a first matching frame, and obtaining the initial matching frame according to the first matching frame; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matching frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matching frame.
2. The method for comparing real-time audio streams according to claim 1, wherein the converting the logfbank feature vector into 0-1 vector comprises:
and comparing the difference value of every two adjacent characteristic values in the logfbank characteristic vector, if the difference value is greater than 0, setting the difference value to be 1, and if the difference value is less than or equal to 0, setting the difference value to be 0.
3. The method of claim 2, wherein the obtaining an initial matching frame according to the first matching frame in the initial matching process comprises:
respectively acquiring M frames subsequent to the first matching frame from a main and standby queue according to the first matching frame, sequentially forming M pairs of audio frames in the main and standby queue, and comparing, wherein if the similarity of the M pairs of audio frames meets the threshold condition, the last pair of audio frames in the M pairs of audio frames is used as an initial matching frame; otherwise, the first matching frame is obtained again, and then the initial matching frame is obtained according to the first matching frame.
4. The method according to claim 3, wherein the similarity of the audio frames is calculated according to a pair of audio frame fingerprint data to be compared, and comprises:
calculating a Hamming distance according to a pair of audio frame fingerprint data to be compared;
obtaining a total distance according to the dimensionality number of 0-1 vectors in the fingerprint data of each audio frame;
and dividing the difference value of the total distance and the Hamming distance by the total distance to obtain the similarity of a pair of audio frames to be compared.
5. The method according to claim 4, wherein the length of the primary and secondary queues is fixed, and in the initial matching process, if the primary and secondary queues are compared to each other and the first matching frame or the initial matching frame meeting the threshold condition is not obtained, the primary and secondary queues are updated according to the fingerprint data of the primary and secondary audio frames to be compared.
6. The method according to claim 5, wherein the obtaining the audio pair matrix to be compared according to the initial matching frame in the continuous matching process comprises:
skipping a preset frame number N according to the initial matching frame to obtain a second matching frame;
and according to the respective positions of second matched frames in the main and standby queues, respectively and continuously taking out N frames backwards, and forming an NxN two-dimensional audio pair matrix by taking the N frames in the main and standby queues as rows and columns, wherein each element in the matrix is an audio pair formed by audio frames in the N frames corresponding to the corresponding rows and columns.
7. The method for comparing real-time audio streams according to claim 6, wherein the sequentially comparing audio pairs in the matrix, and if a matching frame is obtained, updating the initial matching frame to the matching frame includes:
based on the audio pair matrix to be compared, the audio pair to be compared in the current Tth round is taken out, and the method comprises the following steps: taking the audio pairs of the 1 st, … th, T-th row and the T-th column, and the audio pairs of the T-th row, 1 st, … th, T-1 column, T =1, …, N;
identifying whether an audio pair meeting a threshold condition exists according to the audio pair to be compared in the current Tth round, if so, taking the audio pair with the maximum similarity of the audio frames as a matching frame, updating the initial matching frame into the matching frame, and finishing the comparison; and if the comparison result does not exist, taking out the audio pairs to be compared in the next round from the audio pair matrix to be compared, and comparing until traversal is completed.
8. The method according to claim 1, wherein in the initial matching process, the matching frame is obtained along a direction from a tail portion to a head portion of the main/standby queue; in the continuous matching process, the matching frame is obtained along the direction from the head to the tail of the main/standby queue.
9. The method for matching real-time audio streams according to claim 8, further comprising: in the initial matching and the continuous matching, if the matching frame is not obtained, a signal with abnormal comparison is output, otherwise, a delay signal is output according to delay time.
10. A system for comparing real-time audio streams, comprising:
the audio stream access module is used for receiving real-time main and standby audio streams, and obtaining main and standby audio frame signals after preprocessing;
the audio fingerprint generation module is used for extracting logfbank characteristic vectors from each audio frame in the main and standby audio frame signals, and converting the logfbank characteristic vectors into 0-1 vectors to obtain main and standby audio frame fingerprint data to be compared;
the audio fingerprint comparison module is used for respectively adding the fingerprint data of the main and standby audio frames to be compared into the main and standby queues, sequentially performing initial matching and continuous matching to obtain matched frames, and obtaining delay time according to the position difference of the matched frames;
in the initial matching process, sequentially comparing each audio frame in the main queue with each audio frame in the standby queue, acquiring a pair of audio frames of which the similarity meets a threshold condition as first matching frames, and acquiring initial matching frames according to the first matching frames; in the continuous matching process, according to the initial matching frame, acquiring an audio pair matrix to be compared and sequentially comparing audio pairs in the matrix, if the matching frame is acquired, updating the initial matching frame into the matching frame, and acquiring the audio pair matrix to be compared again and sequentially comparing the audio pair matrix; and if the matching frame is not obtained, updating the main/standby queue, and performing initial matching again to obtain the initial matching frame.
CN202210335546.7A 2022-04-01 2022-04-01 Real-time audio stream comparison method and system Active CN114495984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210335546.7A CN114495984B (en) 2022-04-01 2022-04-01 Real-time audio stream comparison method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210335546.7A CN114495984B (en) 2022-04-01 2022-04-01 Real-time audio stream comparison method and system

Publications (2)

Publication Number Publication Date
CN114495984A true CN114495984A (en) 2022-05-13
CN114495984B CN114495984B (en) 2022-06-28

Family

ID=81488382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210335546.7A Active CN114495984B (en) 2022-04-01 2022-04-01 Real-time audio stream comparison method and system

Country Status (1)

Country Link
CN (1) CN114495984B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116346583A (en) * 2023-02-17 2023-06-27 广州市保伦电子有限公司 Main and standby audio switching method and system based on decoding end

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114456A1 (en) * 2012-10-22 2014-04-24 Arbitron Inc. Methods and Systems for Clock Correction and/or Synchronization for Audio Media Measurement Systems
CN204089832U (en) * 2014-09-26 2015-01-07 浙江传媒学院 Double-channel audio otherness checkout gear
CN104505101A (en) * 2014-12-24 2015-04-08 北京巴越赤石科技有限公司 Real-time audio comparison method
CN104992713A (en) * 2015-05-14 2015-10-21 电子科技大学 Fast audio comparing method
CN107481738A (en) * 2017-06-27 2017-12-15 中央电视台 Real-time audio comparison method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114456A1 (en) * 2012-10-22 2014-04-24 Arbitron Inc. Methods and Systems for Clock Correction and/or Synchronization for Audio Media Measurement Systems
CN204089832U (en) * 2014-09-26 2015-01-07 浙江传媒学院 Double-channel audio otherness checkout gear
CN104505101A (en) * 2014-12-24 2015-04-08 北京巴越赤石科技有限公司 Real-time audio comparison method
CN104992713A (en) * 2015-05-14 2015-10-21 电子科技大学 Fast audio comparing method
CN107481738A (en) * 2017-06-27 2017-12-15 中央电视台 Real-time audio comparison method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116346583A (en) * 2023-02-17 2023-06-27 广州市保伦电子有限公司 Main and standby audio switching method and system based on decoding end
CN116346583B (en) * 2023-02-17 2024-05-03 广东保伦电子股份有限公司 Main and standby audio switching method and system based on decoding end

Also Published As

Publication number Publication date
CN114495984B (en) 2022-06-28

Similar Documents

Publication Publication Date Title
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
JP2597791B2 (en) Speech recognition device and method
Thomas et al. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions
US20060053009A1 (en) Distributed speech recognition system and method
JP3114975B2 (en) Speech recognition circuit using phoneme estimation
US7266494B2 (en) Method and apparatus for identifying noise environments from noisy signals
US4975956A (en) Low-bit-rate speech coder using LPC data reduction processing
WO2017162017A1 (en) Method and device for voice data processing and storage medium
CN108899044A (en) Audio signal processing method and device
US20160275964A1 (en) Feature compensation apparatus and method for speech recogntion in noisy environment
Zhang et al. X-tasnet: Robust and accurate time-domain speaker extraction network
JP2002140089A (en) Method and apparatus for pattern recognition training wherein noise reduction is performed after inserted noise is used
Kinoshita et al. Text-informed speech enhancement with deep neural networks.
CN111951796B (en) Speech recognition method and device, electronic equipment and storage medium
US20230326468A1 (en) Audio processing of missing audio information
JP3189598B2 (en) Signal combining method and signal combining apparatus
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN111785288A (en) Voice enhancement method, device, equipment and storage medium
EP4231283A1 (en) Speech recognition method and apparatus, and device, storage medium and program product
CN114495984B (en) Real-time audio stream comparison method and system
CN112530410A (en) Command word recognition method and device
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Adiban et al. Sut system description for anti-spoofing 2017 challenge
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant