CN112019285A - Black broadcast audio recognition method - Google Patents

Black broadcast audio recognition method Download PDF

Info

Publication number
CN112019285A
CN112019285A CN202010935451.XA CN202010935451A CN112019285A CN 112019285 A CN112019285 A CN 112019285A CN 202010935451 A CN202010935451 A CN 202010935451A CN 112019285 A CN112019285 A CN 112019285A
Authority
CN
China
Prior art keywords
audio
signal
similarity
semantic
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010935451.XA
Other languages
Chinese (zh)
Inventor
郑鑫
汤善武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Huaqian Technology Co ltd
Original Assignee
Chengdu Huaqian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Huaqian Technology Co ltd filed Critical Chengdu Huaqian Technology Co ltd
Priority to CN202010935451.XA priority Critical patent/CN112019285A/en
Publication of CN112019285A publication Critical patent/CN112019285A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H20/00Arrangements for broadcast or for distribution combined with broadcast
    • H04H20/12Arrangements for observation, testing or troubleshooting
    • H04H20/14Arrangements for observation, testing or troubleshooting for monitoring programmes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/29Arrangements for monitoring broadcast services or broadcast-related services

Abstract

The invention provides a black broadcast audio identification method, which comprises the following steps: s1, extracting the signal characteristics of the return audio and the reference audio; s2, extracting semantic features of the returned audio and the reference audio; s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics; s4, firstly, comparing according to the semantic similarity, and if the semantic similarity is judged to be high, obtaining a comparison result; and if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result. The invention has better robustness, and can better inhibit the influence of noise and transmission delay: when a single characteristic fails due to noise and the like, reference can be made through other characteristics; meanwhile, the influence of the transmission delay on the semantic analysis is relatively small, and the stability of the semantic analysis under the delay condition can offset the instability of the signal analysis under the delay condition to a certain extent.

Description

Black broadcast audio recognition method
Technical Field
The invention relates to the field of black broadcast identification, in particular to a black broadcast audio identification method.
Background
With the development of information technology and broadcast media technology, black broadcasting has attracted more and more attention in recent years. Black broadcasts have significant social hazards. The black broadcasting base stations are mostly erected in residential districts, so that the health of people is seriously influenced; black broadcasting is full of a large amount of false information, such as fake medicines, fake and inferior products and the like; black broadcasts even affect the stable group of homes and society. Therefore, black broadcasting is firmly struck. The precondition for hitting black broadcasts is to effectively find black broadcasts. However, the technology of black broadcasting is also continuously developed, and the performance of black broadcasting is more concealed: some black broadcasts even occupy the broadcast frequency point of normal broadcasts; the playing content more "looks" like normal broadcast content. Therefore, to identify black broadcasts, more comprehensive and intelligent technical means and processing methods are required.
The audio comparison is an effective idea for finding black broadcasting, and the core idea is as follows: and receiving a broadcast audio signal of a certain frequency point at a certain point, and transmitting the signal back to the comparison center. And comparing the returned audio with the reference audio at the comparison center, and if the returned audio and the reference audio are inconsistent, indicating that the frequency point signal received by the point location is a black broadcast signal, and the periphery of the point location may have a black broadcast signal source. In terms of audio comparison technology, chenyujie and the like describe an audio comparison system and method used in a Guangxi radio station: comparing the AES signal from the sound console, the ASI code stream from the encoder and the FM/AM signal received from the transmitting station, wherein the comparison index is a Mel cepstrum coefficient index in the audio signal frequency domain, and is used for finding whether the black broadcasting phenomena such as illegal interference, inter cut and the like exist. This is a method of alignment based on a single frequency domain feature. Similarly, the audio comparison systems of Tianjin, Guangdong, Liaoning Chaoyang and other broadcasting stations are respectively described in Li Chunshuang, Deng Chuxiong, Zhao Qi, and the like, and the audio signals are led from the front ends of the tuning station and the transmitter for comparison, so that whether abnormal phenomena such as serial broadcasting, wrong broadcasting and the like exist or not can be timely found. The Zhanglin et al describe an audio similarity comparison algorithm, which measures the similarity of audio signals through characteristic parameters such as waveform, envelope curve, zero crossing rate and the like, and is a multi-characteristic audio comparison method. However, the above alignment parameters are still concentrated on the frequency, time domain and space domain of the signal level. The audio comparison system of the phoenix mountain transmitting station proposed by Zheng morning glory describes a safety monitoring scheme of tunnel broadcasting, which ensures the safety of broadcasting information through audio comparison, and the comparison indexes mainly comprise indexes such as Mel cepstrum coefficient, frequency spectrum centroid, average energy, short-time zero-crossing rate and the like. This is also essentially a method of multi-feature alignment at the signal level. The Yandong edge of the Sihua university converts audio into text by using a voice recognition technology, and detects sensitive words in the text by using a black broadcast keyword library so as to find black broadcast. This approach can be seen as a beneficial addition to the mainstream signal level detection approach.
After analyzing the existing audio comparison system and method, the following are considered: when the audio comparison is applied to a local closed-loop system, the noise introduced inside the system is small or even negligible, so that only one-level comparison analysis of the signal is suitable. However, if the broadcast reception signals of the monitoring nodes located in the urban buildings and the villages are transmitted back for comparison analysis outside the broadcasting station or the broadcasting transmission system, the influence of noise must be considered. Noise is just easy to cause variation of some single characteristic quantities, and finally the alignment is invalid. In addition, in a real wide area scene, there must be a transmission delay between the far-end received signal and the reference signal, and sliding window matching should be added before comparison, which actually significantly increases the time complexity of audio comparison. The delay factor is further superposed with the noise factor, which reduces the accuracy of signal level comparison, and the semantic comparison can better inhibit the influence of delay.
Disclosure of Invention
Aiming at the problems in the prior art, the fusion comparison method for identifying the black broadcast and based on the broadcast electrical signal characteristics and the content semantic characteristics is provided, and the core process is as follows: firstly, reflecting the characteristics of the broadcast signals by calculating indexes such as short-time energy, short-time zero-crossing rate, spectrum centroid and the like on a signal level; secondly, on the semantic level, the content characteristics of the broadcast are reflected by carrying out text word frequency statistics after voice recognition; and finally, establishing a multi-stage fusion judgment rule based on the signal characteristics and the semantic characteristics to detect whether the black broadcast occupying the normal broadcast frequency point exists. The experiment reflects the effectiveness and engineering applicability of the invention.
The technical scheme adopted by the invention is as follows: a black broadcast audio recognition method, comprising:
s1, extracting the signal characteristics of the return audio and the reference audio;
s2, extracting semantic features of the returned audio and the reference audio;
s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics;
s4, firstly, comparing according to the semantic similarity, and if the semantic similarity is judged to be high, obtaining a comparison result; and if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result.
Further, in S1, the signal characteristics include a spectral centroid, a short-term average energy, and a short-term zero-crossing rate, which are calculated from the frequency data of the decoded audio file.
Further, the sub-step of S2 includes:
s21, identifying the audio files through a plurality of voice identification interfaces to obtain a plurality of texts output by the corresponding interfaces;
s22, respectively carrying out word frequency analysis on the output texts to form word frequency dictionaries;
and S23, summarizing word frequency dictionaries formed by a plurality of interface output texts, adding weights, and taking words with the word frequency larger than a set threshold in the summarized word frequency dictionaries as keywords to obtain the semantic features of the audio.
Further, in S21, the number of the voice recognition interfaces includes 3, and at least includes 1 network interface and 1 local interface.
Further, in S22, the specific process of forming the word frequency dictionary by the word frequency analysis includes:
s221, segmenting the text, storing the segmented text in a segmented word array, initializing a word frequency dictionary, and setting a segmented word array subscript i to be 0
S222, taking the ith vocabulary of the participle array, judging whether the vocabulary is a null word, if so, entering S, otherwise, entering S2,
s223, judging whether the word is in the dictionary, if so, adding 1 to the frequency number of the vocabulary in the dictionary, otherwise, adding the vocabulary to the dictionary, and setting the frequency number of the vocabulary as 1;
s224, judging whether the participle array is traversed or not, if not, entering S5, if so, adding 1 to the value of i, and entering S2;
and S225, forming a word frequency dictionary.
Further, the sub-step of S23 includes,
s231, summarizing the word frequency dictionary:
Figure BDA0002671756280000031
wherein, when j equals 0, it represents the reference audio word frequency dictionary, and when j equals 1, it represents the returned audio word frequency dictionary.
Figure BDA0002671756280000032
Represents the vocabulary in the reference audio,
Figure BDA0002671756280000033
representing the word frequency number of the vocabulary;
Figure BDA0002671756280000034
representing the vocabulary in the returned audio,
Figure BDA0002671756280000035
number of words representing the vocabulary, N1、N2Respectively representing the number of vocabularies in the reference audio and the returned audio;
s232, taking the vocabulary with the word frequency larger than the set threshold as a keyword:
key_setj=(key1,key2,…keyi…)
wherein, key is the vocabulary with higher frequency number in the vocabulary ci _ dic, j is 0, 1; key _ set0Is marked as a reference audio keyword, key _ set1Marked as the returned audio keyword.
Further, the signal similarity calculation in S3 specifically includes:
s311, performing dimensionality reduction processing on the signal characteristics;
Figure BDA0002671756280000036
Figure BDA0002671756280000037
wherein, L is the length of the vector s, M is the length of the new vector v after dimensionality reduction, and M can be set according to requirements; step is the step size, and step s (j) is summed to form 1 v (i);
s312, normalizing the signal features after dimension reduction:
Figure BDA0002671756280000038
wherein v' is a normalized signal feature vector, and each component range is [0,1 ];
s313, according to the normalized signal characteristics, similarity calculation is respectively carried out on the returned audio and the reference audio, wherein the similarity calculation method comprises the following steps:
Figure BDA0002671756280000041
wherein, a and b are normalized vectors after dimension reduction of characteristics of the returned audio and the reference audio signal, and similarity of audio frequency spectrum centroid, short-time average energy and short-time zero-crossing rate characteristics are respectively marked as sim1、sim2、sim3
Further, the semantic similarity calculation in S3 includes:
s321, calculating the number of common keywords in the semantic features of the returned audio and the reference audio:
sim_num=NUM(key_set0∩key_set1)
wherein NUM (.) represents the number of elements in the collection;
s322, semantic similarity is calculated according to the number of the common keywords:
Figure BDA0002671756280000042
wherein threshold is a decision threshold, HIGH represents HIGH similarity, and LOW represents LOW similarity.
Further, in step 4, the multi-level fusion decision rule is: firstly, according to semantic similarity comparison, when the semantic comparison can not be judged to be similar, signal similarity comparison is carried out, and if the overall comparison result is smaller than a similar threshold value, the fact that the far-end receiving node has a black broadcast phenomenon is indicated.
Further, the specific calculation method of the comparison result is as follows:
Figure BDA0002671756280000043
Figure BDA0002671756280000044
wherein, Final _ sim is the overall comparison result, and signal _ sim is the signal characteristic comparison result.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: the invention has better robustness, and can better inhibit the influence of noise and transmission delay: when a single characteristic fails due to noise and the like, reference can be made through other characteristics; meanwhile, the influence of transmission delay on semantic analysis is relatively small, and sliding window matching is not needed in a certain range. The stability of the semantic analysis under the delay condition can offset the instability of the signal analysis under the delay condition to a certain extent.
Drawings
Fig. 1 is a diagram of a black broadcast audio recognition process of the present invention.
Fig. 2 is a diagram of a semantic feature extraction process in the present invention.
FIG. 3 is a flow chart of forming a word frequency dictionary in the present invention.
Fig. 4 is a frequency waveform characteristic diagram of a segment of audio under superposition of noise and delay effects in an embodiment of the invention.
Fig. 5 is a diagram of extracting signal features of the audio of line 1 in fig. 4.
Fig. 6 is a diagram of signal features for extracting the 2 nd row audio in fig. 4.
Fig. 7 is a diagram of signal features for extracting the 3 rd row audio in fig. 4.
Fig. 8 is a diagram of extracting signal features of the 4 th row of audio in fig. 4.
Fig. 9 is a diagram of extracting signal features of the 5 th row of audio in fig. 4.
Fig. 10 is a diagram of signal features for extracting the 6 th row of audio in fig. 4.
FIG. 11 is a graph showing the effect of various alignment methods under noisy conditions.
FIG. 12 is a graph showing the effect of various alignment methods under delayed conditions.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a fusion comparison method for black broadcast identification based on broadcast electrical signal features and content semantic features, which includes:
s1, extracting the signal characteristics of the return audio and the reference audio;
s2, extracting semantic features of the returned audio and the reference audio;
s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics;
s4, firstly, comparing according to semantic similarity, and if the semantic similarity is judged to be similar, obtaining a comparison result; and if the semantic similarity comparison cannot be judged to be similar, performing signal similarity comparison to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result.
The specific scheme is as follows:
s1, extracting the audio signal characteristic
The signal characteristics used by the invention comprise a frequency spectrum centroid, short-time average energy and a short-time zero-crossing rate, and the 3 characteristics can be calculated by frequency data. The frequency data is from a decoded audio file, such as a wav file.
The spectral centroid describes the brightness of the sound: sounds with dull, low-lying quality tend to have more low-frequency content, with a relatively low spectral centroid; most of the bright, cheerful qualities are concentrated on high frequencies, with relatively high spectral centroids. The method for calculating the mass center of the frequency spectrum comprises the following steps:
Figure BDA0002671756280000051
wherein, f (n) is the frequency of the audio signal from the audio file. E (n) is the spectral energy of the corresponding frequency after short-time fourier transform of the continuous time domain signal x (t).
The short-term energy/short-term average energy is a statistic of speech energy in a time window, and is an important index for audio feature analysis. The short-term energy has important purposes of distinguishing unvoiced sound and voiced sound and judging voiced segments and unvoiced segments. The short-time average energy calculation method comprises the following steps:
Figure BDA0002671756280000061
wherein N is the sliding window length.
The short-time average zero crossing rate is a characteristic parameter in the time domain analysis of the voice signal. The zero crossing rate refers to the number of times a signal passes through a zero value in unit time; the zero-crossing rate over a period of time is referred to as the average zero-crossing rate. The short-term average zero-crossing rate can be used for judging the unvoiced sound and the voiced sound of the voice signal. If the zero crossing rate is high, the voice signal is unvoiced; if the zero crossing rate is low, the speech signal is voiced. The short-time zero crossing rate calculation method comprises the following steps:
Figure BDA0002671756280000062
wherein sign [ ] is a sign function, namely:
Figure BDA0002671756280000063
the spectral centroid, the short-term energy, and the short-term zero-crossing rate are all represented as vectors with a length of the sliding window size N, and are applied in subsequent multi-feature pairings.
S2, audio semantic feature extraction
As shown in fig. 2, the audio semantic features refer to the Chinese word frequency and the subject word contained in the audio file, which reflect the general meaning of the audio content; the input of the semantic extraction process is an audio file, the output is a word frequency list and a subject word list, and the extraction process specifically comprises the following steps:
s21, identifying the audio files through a plurality of voice identification interfaces to obtain a plurality of texts output by the corresponding interfaces;
s22, respectively carrying out word frequency analysis on the output texts to form word frequency dictionaries;
and S23, summarizing word frequency dictionaries formed by a plurality of interface output texts, adding weights, and taking words with the word frequency larger than a set threshold in the summarized word frequency dictionaries as keywords to obtain the semantic features of the audio.
In consideration of system robustness and reliability, the speech recognition includes 3 interface channels, which include at least 1 network interface and 1 local interface. Preferably, the network interface can be selected from a hundredth interface, a message flight interface and the like; the local interface may select a pocketspphinx interface. In terms of robustness, when the network interface is interrupted, the local interface can ensure that the black broadcast identification works normally; in terms of reliability, in order to suppress the influence of noise, only a keyword common to a plurality of interface channels is extracted, which is a reliable semantic feature.
The word frequency analysis is needed from the text to the keywords, and the premise of the word frequency analysis is word segmentation. The present invention performs word segmentation by means of an open source jieba tool. The word segmentation mode has 3 types: full mode, precision mode, and search engine mode, where precision mode attempts to cut the sentence most accurately, suitable for text analysis, and the present embodiment employs precision mode.
After word segmentation, word frequency statistics is completed by means of a dictionary data structure, and the specific process is as follows:
as shown in fig. 3, the specific process of forming the word frequency dictionary by the word frequency analysis includes:
s221, segmenting the text, storing the segmented text in a segmented word array, initializing a word frequency dictionary, and setting a segmented word array subscript i to be 0
S222, taking the ith vocabulary of the participle array, judging whether the vocabulary is a null word, if so, entering S, otherwise, entering S2,
s223, judging whether the word is in the dictionary, if so, adding 1 to the frequency number of the vocabulary in the dictionary, otherwise, adding the vocabulary to the dictionary, and setting the frequency number of the vocabulary as 1;
s224, judging whether the participle array is traversed or not, if not, entering S5, if so, adding 1 to the value of i, and entering S2;
and S225, forming a word frequency dictionary.
Due to the fact that noise exists in an actual system and influences the voice recognition effect, word frequencies counted by multiple channels may have differences. And summarizing the word frequency dictionary formed by the plurality of channels, and adding the weights of the word frequency dictionary. The weight is increased more for the vocabulary existing in a plurality of channels; the vocabulary only appears in individual channels, the weight increase is not obvious, and the vocabulary with the word frequency larger than the set threshold is used as the key word. In this embodiment, the keywords will also be abstracted into vectors for subsequent multi-feature comparison.
S3 similarity evaluation of signal features and semantic features
The signal feature similarity evaluation comprises:
let s1 and s2 be the signal characteristics of the return audio and the reference audio, respectively. Obviously, in the present embodiment, s1 and s2 represent the frequency centroid, the short-term energy characteristic and the short-term zero-crossing rate characteristic of the returned audio and the reference audio. s1, s2 are vectors with equal initial dimensions and related size and audio duration. The method for calculating the similarity of the audio signals comprises the following steps:
(1) and the vector dimension is reduced, the calculated amount is reduced, and the noise interference is inhibited.
The dimensions of s1, s2 are large because of the large number of frequencies contained in the audio file. The dimension reduction is carried out on the signal characteristics, which is not only beneficial to reducing the calculated amount, but also beneficial to inhibiting noise interference. The dimensionality reduction is carried out by adopting the following formula:
Figure BDA0002671756280000071
Figure BDA0002671756280000072
wherein, L is the length of the vector s, and M is the length of the new vector v after dimensionality reduction. step is the step size, and step s (j) is summed to form 1 v (i). M is set to 100 in this embodiment. The new vector v maintains the contour characteristics of the original vector s, but reduces the data volume and increases the robustness. It should be noted that: the proper size of M can suppress the effect of delay to some extent.
(2) Normalization
Dimension reduction unifies dimension, and in order to further unify the expression range of each component, the new vector v is further normalized as follows:
Figure BDA0002671756280000081
v' is a normalized vector, and each component is represented by a range of [0,1 ].
(3) Vector alignment
And comparing the two vectors by calculating the similarity of the two vectors, wherein the similarity can be calculated by a cosine method, a Pearson coefficient method and a distance method. In this example, using the Pearson coefficient method, the similarity is calculated as follows:
Figure BDA0002671756280000082
wherein, a and b are normalized vectors after dimension reduction of the returned audio and the reference audio respectively. a isiIs the ith element in a, biIs the ith element. The similarity of the audio frequency spectrum centroid, the short-time energy and the short-time zero-crossing rate characteristic is respectively obtained by adopting the formula and is respectively recorded as sim1、sim2、sim3
Similarity evaluation of semantic levels:
for both the return audio and the reference audio, semantic features are defined as:
Figure BDA0002671756280000083
where j is 0, the reference audio feature is represented, and j is 1, the returned audio feature is represented.
Figure BDA0002671756280000084
Represents the vocabulary in the reference audio,
Figure BDA0002671756280000085
representing the word frequency number of the vocabulary;
Figure BDA0002671756280000086
representing the vocabulary in the returned audio,
Figure BDA0002671756280000087
representing the word frequency of the vocabulary. Reference audio and return audioThe number of words in the frequency may be different, and is respectively expressed as N1、N2
Firstly, keyword analysis is carried out on the returned audio and the reference audio, and the words with higher word frequency number are keywords, and the method comprises the following steps:
key_setj=(key1,key2,…keyi…)
wherein, key is the vocabulary with higher frequency number in the vocabulary ci _ dic, j is 0, 1; key _ set0Is marked as a reference audio keyword, key _ set1Marked as the returned audio keyword.
Taking the number of common keywords in the two:
sim_num=NUM(key_set0∩key_set1)
where NUM (.) represents the number of elements in the fetch set. Thus, the calculation method of semantic similarity can be expressed as:
Figure BDA0002671756280000088
in the above formula, sim4I.e., semantic feature similarity, to distinguish sim1、sim2、sim3. the threshold is a decision threshold, and in this embodiment, the threshold is 2, which means that if there are 2 or more words in the keyword, the similarity is considered to be high, otherwise, the similarity is low.
S4, comprehensive evaluation
Noise and delay have different effects on different characteristics. Specifically, the characteristics of the invention are that the noise has a large influence on the audio recognition, and further has a large influence on semantic analysis; but the delay has less impact on semantic analysis. In practical systems, noise can be suppressed by a variety of means, but delay is always present; for some features, the time delay may introduce additional registration operations. Therefore, the invention adopts a multi-stage comparison evaluation method, firstly, the semantic similarity is compared, if the semantic similarity is judged to be high similarity, the comparison result is obtained; if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, finishing the identification of the black broadcast audio according to the comparison result, wherein the expression of the similarity is as follows:
Figure BDA0002671756280000091
Figure BDA0002671756280000092
the expression of the above formula shows that semantic feature comparison is firstly carried out, when the semantic feature comparison cannot be judged to be similar, then signal level multi-feature comparison is carried out, and the signal feature comparison refers to the features of frequency center, energy, zero crossing rate and the like.
And when the comparison result Final _ sim is smaller than a certain threshold value, the frequency point is indicated to have a black broadcast phenomenon possibly at a far-end receiving node, and the frequency point needs to be immediately alarmed and requests related personnel to process.
The present embodiment provides an experimental testing process for extracting and comparing signal features and semantic features in the presence of noise and time-delay interference, and the duration of audio in all comparisons is about 30 seconds. Fig. 4 is a frequency waveform characteristic of a section of audio under superposition of noise and delay effects. Behavior 1 is noise-free and delay-free audio, representing the original reference signal; lines 2 through 6 represent noisy delayed audio received remotely. The noise of the 2 nd and 3 rd rows is small, the noise of the 4 th and 5 th rows is large, the noise of the 6 th row is maximum, the delay of the 2 nd, 4 th and 6 th rows is small, and the delay of the 3 rd and 5 th rows is large. It can be seen that: as noise and delay increase, the reference audio and far-end audio signals differ at the same time.
Fig. 5-10 reflect the feature extraction effect of the comparison method of the present invention on the audio of fig. 4, including signal features and semantic features, specifically, fig. 5 is a signal feature diagram extracted by analyzing the spectrum center, short-term energy, zero-crossing rate and semantic features of the audio (reference) of row 1 of fig. 4, where the semantic features: "computer science", "research", "old talk", "programming", "problem", "progress", "update", "difficult", "maybe", "why"; fig. 6 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio of line 2 of fig. 4, the semantic features: "computer science", "old life", "programming", "research", "book", "progress", "back", "hard race", "learning", "maybe"; fig. 7 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 3 of fig. 4, the semantic features: "computer science", "old-fashioned", "programming", "do not ask", "question", "book", "back", "hard race", "learning", "may"; fig. 8 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 4 of fig. 4, where the semantic features are: [ 'computer science', 'old age', 'programming', 'study', 'back', 'up-to-date', 'difficult track', 'learning', 'calculation', 'maybe' ]; fig. 9 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 5 of fig. 4, the semantic features: "computer science", "old-fashioned", "programming", "research", "question", "book", "cell phone", "back", "hard track", "learning"; fig. 10 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 6 of fig. 4, the semantic features: [ 'science' ].
It can be seen that the signal level features have been significantly different, and sliding window matching must be introduced to improve the comparison accuracy, which in turn leads to an order of magnitude increase in the calculation workload and also causes errors; semantic level features still maintain relative stability. It can also be seen from fig. 5-10 that the alignment can be done semantically directly in most cases using the alignment method of the present invention without additional time-delay induced registration operations. However, when the noise is large to a certain degree, the semantic feature extraction is invalid and cannot be used as an audio comparison basis, and signal level features are used for comparison, although the signal features are also greatly influenced at this time.
In another embodiment, the comparison between the returned audio and the reference audio is performed by relying on a real broadcast transmitter subsystem, a far-end node acquisition returning subsystem and a data center analysis subsystem, and the comparison effect between the signature comparison method and the conventional signal level single-feature comparison method and the conventional signal level multi-feature comparison method is obtained by real acquisition data analysis and theoretical analysis, as shown in fig. 11, which reflects the relationship between the comparison method and the returned audio noise intensity without time delay: the accuracy of various comparison methods in the initial stage is high; as the noise intensity begins to increase, the accuracy of each comparison method begins to decrease, but the comparison method of the present invention and the traditional method of comparison by adopting a plurality of characteristics are relatively less affected; with the further increase of noise, the comparison method of the invention degenerates into a signal level multi-feature comparison method, and the effects of various comparison methods gradually converge.
Fig. 7 reflects the correlation between the comparison method and the return audio delay: the accuracy of various comparison methods in the initial stage is high; with the increase of the delay, the accuracy of the comparison method based on the signal begins to decrease, but the accuracy of the comparison method of the invention basically maintains unchanged; with the further increase of noise, the comparison method of the invention degenerates into a signal level multi-feature comparison method; finally, the comparison accuracy is converged with the further increase of the time delay. With reference to fig. 11 and 12, the comparison method of the present invention has the best overall performance and the strongest tolerance to interference under the condition of noise and delay in the returned audio.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.
All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.
Any feature disclosed in this specification may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

Claims (9)

1. A black broadcast audio recognition method, comprising:
s1, extracting the signal characteristics of the return audio and the reference audio;
s2, extracting semantic features of the returned audio and the reference audio;
s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics;
s4, firstly, comparing according to the semantic similarity, and if the semantic similarity is judged to be high, obtaining a comparison result; and if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result.
2. The black broadcasting audio recognition method of claim 1, wherein the signal characteristics comprise a spectral centroid, a short-time average energy and a short-time zero-crossing rate, which are calculated from the frequency data of the decoded audio file in S1.
3. The black broadcast audio recognition method according to claim 1, wherein ss1S2 specifically includes:
s21, identifying the audio files through a plurality of voice identification interfaces to obtain a plurality of texts output by the corresponding interfaces;
s22, respectively carrying out word frequency analysis on the output texts to form word frequency dictionaries;
and S23, summarizing word frequency dictionaries formed by a plurality of interface output texts, adding weights, and taking words with the word frequency larger than a set threshold in the summarized word frequency dictionaries as keywords to obtain the semantic features of the audio.
4. The black broadcasting audio recognition method of claim 1, wherein in the S21, the voice recognition interfaces include 3 interfaces, at least 1 network interface and 1 local interface.
5. The black broadcasting audio recognition method of claim 1, wherein in S22, the specific process of forming the word frequency dictionary by word frequency analysis comprises:
s221, segmenting the text, storing the segmented text in a segmented word array, initializing a word frequency dictionary, and setting a segmented word array subscript i to be 0
S222, taking the ith vocabulary of the participle array, judging whether the vocabulary is a null word, if so, entering S, otherwise, entering S2,
s223, judging whether the word is in the dictionary, if so, adding 1 to the frequency number of the vocabulary in the dictionary, otherwise, adding the vocabulary to the dictionary, and setting the frequency number of the vocabulary as 1;
s224, judging whether the participle array is traversed or not, if not, entering S5, if so, adding 1 to the value of i, and entering S2;
and S225, forming a word frequency dictionary.
6. The black broadcast audio recognition method according to claim 1, wherein the S23 specifically includes,
s231, summarizing the word frequency dictionary:
Figure FDA0002671756270000011
wherein, when j equals 0, it represents the reference audio word frequency dictionary, and when j equals 1, it represents the returned audio word frequency dictionary.
Figure FDA0002671756270000012
Represents the vocabulary in the reference audio,
Figure FDA0002671756270000013
representing the word frequency number of the vocabulary;
Figure FDA0002671756270000014
representing the vocabulary in the returned audio,
Figure FDA0002671756270000015
number of words representing the vocabulary, N1、N2Respectively representing the number of vocabularies in the reference audio and the returned audio;
s232, taking the vocabulary with the word frequency larger than the set threshold as a keyword:
key_setj=(key1,key2,…keyi…)
wherein, key is the vocabulary with higher frequency number in the vocabulary ci _ dic, j is 0, 1; key _ set0Is marked as a reference audio keyword, key _ set1Marked as the returned audio keyword.
7. The method for identifying black broadcast audio according to claim 1, wherein the signal similarity calculation in S3 specifically comprises:
s311, performing a dimensionality reduction process on the signal features to form a new vector v ═ [ v (0), v (1),.., v (m);
Figure FDA0002671756270000021
Figure FDA0002671756270000022
wherein, L is the length of the vector s, M is the length of the new vector v after dimensionality reduction, and M can be set according to requirements; step is the step size, and step s (j) is summed to form 1 v (i);
s312, normalizing the signal features after dimension reduction:
Figure FDA0002671756270000023
wherein v' is a normalized signal feature vector, and each component range is [0,1 ];
s313, according to the normalized signal characteristics, similarity calculation is respectively carried out on the returned audio and the reference audio, wherein the similarity calculation method comprises the following steps:
Figure FDA0002671756270000024
wherein, a and b are normalized vectors after dimension reduction of characteristics of the returned audio and the reference audio signal, and similarity of audio frequency spectrum centroid, short-time average energy and short-time zero-crossing rate characteristics are respectively marked as sim1、sim2、sim3
8. The black broadcast audio recognition method of claim 1, wherein the semantic similarity calculation in S3 comprises:
s321, calculating the number of common keywords in the semantic features of the returned audio and the reference audio:
sim_num=NUM(key_set0∩key_set1)
wherein NUM (.) represents the number of elements in the collection;
s322, semantic similarity is calculated according to the number of the common keywords:
Figure FDA0002671756270000031
wherein threshold is a decision threshold, HIGH represents HIGH similarity, and LOW represents LOW similarity.
9. The method of claim 1, wherein in S4, the specific calculation method of the comparison result is:
Figure FDA0002671756270000032
Figure FDA0002671756270000033
wherein, the Final _ sim is a comparison result, the signal _ sim is a signal characteristic comparison result, when the comparison result Final _ sim is smaller than a threshold value, it is determined that the frequency point has a black broadcast phenomenon at the remote receiving node, the threshold value is set between 0.3 and 1, and the higher the threshold value is set, the stricter the determination is.
CN202010935451.XA 2020-09-08 2020-09-08 Black broadcast audio recognition method Withdrawn CN112019285A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010935451.XA CN112019285A (en) 2020-09-08 2020-09-08 Black broadcast audio recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010935451.XA CN112019285A (en) 2020-09-08 2020-09-08 Black broadcast audio recognition method

Publications (1)

Publication Number Publication Date
CN112019285A true CN112019285A (en) 2020-12-01

Family

ID=73516133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010935451.XA Withdrawn CN112019285A (en) 2020-09-08 2020-09-08 Black broadcast audio recognition method

Country Status (1)

Country Link
CN (1) CN112019285A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100777A (en) * 2016-05-27 2016-11-09 西华大学 Broadcast support method based on speech recognition technology
CN109995450A (en) * 2019-04-08 2019-07-09 南京航空航天大学 One kind is based on cloud speech recognition and Intelligent detecting " black broadcast " method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100777A (en) * 2016-05-27 2016-11-09 西华大学 Broadcast support method based on speech recognition technology
CN109995450A (en) * 2019-04-08 2019-07-09 南京航空航天大学 One kind is based on cloud speech recognition and Intelligent detecting " black broadcast " method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑鑫、卢宇: "一种用于识别非法广播的音频多特征比对方法", 广播与电视技术, pages 1 *

Similar Documents

Publication Publication Date Title
CN103179122B (en) A kind of anti-telecommunications telephone fraud method and system based on voice semantic content analysis
Berenzweig et al. Locating singing voice segments within music signals
US9257121B2 (en) Device and method for pass-phrase modeling for speaker verification, and verification system
US6104989A (en) Real time detection of topical changes and topic identification via likelihood based methods
Muscariello et al. Audio keyword extraction by unsupervised word discovery
McDonough et al. Approaches to topic identification on the switchboard corpus
Myer et al. Efficient keyword spotting using time delay neural networks
CN1655234B (en) Apparatus and method for distinguishing vocal sound from other sounds
CN110120230B (en) Acoustic event detection method and device
CN105374352A (en) Voice activation method and system
CN106910495A (en) A kind of audio classification system and method for being applied to abnormal sound detection
CN104732972A (en) HMM voiceprint recognition signing-in method and system based on grouping statistics
CN110689906A (en) Law enforcement detection method and system based on voice processing technology
Gandhe et al. Using web text to improve keyword spotting in speech
CN111917788A (en) HMM model-based SQL injection attack detection method
Yao et al. Symmetric saliency-based adversarial attack to speaker identification
Prazak et al. Speaker diarization using PLDA-based speaker clustering
CN109920447A (en) Recording fraud detection method based on sef-adapting filter Amplitude & Phase feature extraction
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN112019285A (en) Black broadcast audio recognition method
Menon et al. ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages
Asami et al. Recurrent out-of-vocabulary word detection based on distribution of features
Ramabhadran et al. Fast decoding for open vocabulary spoken term detection
CN113823326A (en) Method for using training sample of efficient voice keyword detector
Rebai et al. Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201201

WW01 Invention patent application withdrawn after publication