CN112019285A

CN112019285A - Black broadcast audio recognition method

Info

Publication number: CN112019285A
Application number: CN202010935451.XA
Authority: CN
Inventors: 郑鑫; 汤善武
Original assignee: Chengdu Huaqian Technology Co ltd
Current assignee: Chengdu Huaqian Technology Co ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-01

Abstract

The invention provides a black broadcast audio identification method, which comprises the following steps: s1, extracting the signal characteristics of the return audio and the reference audio; s2, extracting semantic features of the returned audio and the reference audio; s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics; s4, firstly, comparing according to the semantic similarity, and if the semantic similarity is judged to be high, obtaining a comparison result; and if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result. The invention has better robustness, and can better inhibit the influence of noise and transmission delay: when a single characteristic fails due to noise and the like, reference can be made through other characteristics; meanwhile, the influence of the transmission delay on the semantic analysis is relatively small, and the stability of the semantic analysis under the delay condition can offset the instability of the signal analysis under the delay condition to a certain extent.

Description

Black broadcast audio recognition method

Technical Field

The invention relates to the field of black broadcast identification, in particular to a black broadcast audio identification method.

Background

With the development of information technology and broadcast media technology, black broadcasting has attracted more and more attention in recent years. Black broadcasts have significant social hazards. The black broadcasting base stations are mostly erected in residential districts, so that the health of people is seriously influenced; black broadcasting is full of a large amount of false information, such as fake medicines, fake and inferior products and the like; black broadcasts even affect the stable group of homes and society. Therefore, black broadcasting is firmly struck. The precondition for hitting black broadcasts is to effectively find black broadcasts. However, the technology of black broadcasting is also continuously developed, and the performance of black broadcasting is more concealed: some black broadcasts even occupy the broadcast frequency point of normal broadcasts; the playing content more "looks" like normal broadcast content. Therefore, to identify black broadcasts, more comprehensive and intelligent technical means and processing methods are required.

The audio comparison is an effective idea for finding black broadcasting, and the core idea is as follows: and receiving a broadcast audio signal of a certain frequency point at a certain point, and transmitting the signal back to the comparison center. And comparing the returned audio with the reference audio at the comparison center, and if the returned audio and the reference audio are inconsistent, indicating that the frequency point signal received by the point location is a black broadcast signal, and the periphery of the point location may have a black broadcast signal source. In terms of audio comparison technology, chenyujie and the like describe an audio comparison system and method used in a Guangxi radio station: comparing the AES signal from the sound console, the ASI code stream from the encoder and the FM/AM signal received from the transmitting station, wherein the comparison index is a Mel cepstrum coefficient index in the audio signal frequency domain, and is used for finding whether the black broadcasting phenomena such as illegal interference, inter cut and the like exist. This is a method of alignment based on a single frequency domain feature. Similarly, the audio comparison systems of Tianjin, Guangdong, Liaoning Chaoyang and other broadcasting stations are respectively described in Li Chunshuang, Deng Chuxiong, Zhao Qi, and the like, and the audio signals are led from the front ends of the tuning station and the transmitter for comparison, so that whether abnormal phenomena such as serial broadcasting, wrong broadcasting and the like exist or not can be timely found. The Zhanglin et al describe an audio similarity comparison algorithm, which measures the similarity of audio signals through characteristic parameters such as waveform, envelope curve, zero crossing rate and the like, and is a multi-characteristic audio comparison method. However, the above alignment parameters are still concentrated on the frequency, time domain and space domain of the signal level. The audio comparison system of the phoenix mountain transmitting station proposed by Zheng morning glory describes a safety monitoring scheme of tunnel broadcasting, which ensures the safety of broadcasting information through audio comparison, and the comparison indexes mainly comprise indexes such as Mel cepstrum coefficient, frequency spectrum centroid, average energy, short-time zero-crossing rate and the like. This is also essentially a method of multi-feature alignment at the signal level. The Yandong edge of the Sihua university converts audio into text by using a voice recognition technology, and detects sensitive words in the text by using a black broadcast keyword library so as to find black broadcast. This approach can be seen as a beneficial addition to the mainstream signal level detection approach.

After analyzing the existing audio comparison system and method, the following are considered: when the audio comparison is applied to a local closed-loop system, the noise introduced inside the system is small or even negligible, so that only one-level comparison analysis of the signal is suitable. However, if the broadcast reception signals of the monitoring nodes located in the urban buildings and the villages are transmitted back for comparison analysis outside the broadcasting station or the broadcasting transmission system, the influence of noise must be considered. Noise is just easy to cause variation of some single characteristic quantities, and finally the alignment is invalid. In addition, in a real wide area scene, there must be a transmission delay between the far-end received signal and the reference signal, and sliding window matching should be added before comparison, which actually significantly increases the time complexity of audio comparison. The delay factor is further superposed with the noise factor, which reduces the accuracy of signal level comparison, and the semantic comparison can better inhibit the influence of delay.

Disclosure of Invention

Aiming at the problems in the prior art, the fusion comparison method for identifying the black broadcast and based on the broadcast electrical signal characteristics and the content semantic characteristics is provided, and the core process is as follows: firstly, reflecting the characteristics of the broadcast signals by calculating indexes such as short-time energy, short-time zero-crossing rate, spectrum centroid and the like on a signal level; secondly, on the semantic level, the content characteristics of the broadcast are reflected by carrying out text word frequency statistics after voice recognition; and finally, establishing a multi-stage fusion judgment rule based on the signal characteristics and the semantic characteristics to detect whether the black broadcast occupying the normal broadcast frequency point exists. The experiment reflects the effectiveness and engineering applicability of the invention.

The technical scheme adopted by the invention is as follows: a black broadcast audio recognition method, comprising:

s1, extracting the signal characteristics of the return audio and the reference audio;

s2, extracting semantic features of the returned audio and the reference audio;

s3, respectively carrying out signal similarity and semantic similarity calculation on the returned audio and the reference audio according to the signal characteristics and the semantic characteristics;

s4, firstly, comparing according to the semantic similarity, and if the semantic similarity is judged to be high, obtaining a comparison result; and if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result.

Further, in S1, the signal characteristics include a spectral centroid, a short-term average energy, and a short-term zero-crossing rate, which are calculated from the frequency data of the decoded audio file.

Further, the sub-step of S2 includes:

s21, identifying the audio files through a plurality of voice identification interfaces to obtain a plurality of texts output by the corresponding interfaces;

s22, respectively carrying out word frequency analysis on the output texts to form word frequency dictionaries;

and S23, summarizing word frequency dictionaries formed by a plurality of interface output texts, adding weights, and taking words with the word frequency larger than a set threshold in the summarized word frequency dictionaries as keywords to obtain the semantic features of the audio.

Further, in S21, the number of the voice recognition interfaces includes 3, and at least includes 1 network interface and 1 local interface.

Further, in S22, the specific process of forming the word frequency dictionary by the word frequency analysis includes:

s221, segmenting the text, storing the segmented text in a segmented word array, initializing a word frequency dictionary, and setting a segmented word array subscript i to be 0

S222, taking the ith vocabulary of the participle array, judging whether the vocabulary is a null word, if so, entering S, otherwise, entering S2,

s223, judging whether the word is in the dictionary, if so, adding 1 to the frequency number of the vocabulary in the dictionary, otherwise, adding the vocabulary to the dictionary, and setting the frequency number of the vocabulary as 1;

s224, judging whether the participle array is traversed or not, if not, entering S5, if so, adding 1 to the value of i, and entering S2;

and S225, forming a word frequency dictionary.

Further, the sub-step of S23 includes,

s231, summarizing the word frequency dictionary:

wherein, when j equals 0, it represents the reference audio word frequency dictionary, and when j equals 1, it represents the returned audio word frequency dictionary.

Represents the vocabulary in the reference audio,

representing the word frequency number of the vocabulary;

representing the vocabulary in the returned audio,

number of words representing the vocabulary, N₁、N₂Respectively representing the number of vocabularies in the reference audio and the returned audio;

s232, taking the vocabulary with the word frequency larger than the set threshold as a keyword:

key_set_j＝(key₁,key₂,…key_i…)

wherein, key is the vocabulary with higher frequency number in the vocabulary ci _ dic, j is 0, 1; key _ set₀Is marked as a reference audio keyword, key _ set₁Marked as the returned audio keyword.

Further, the signal similarity calculation in S3 specifically includes:

s311, performing dimensionality reduction processing on the signal characteristics;

wherein, L is the length of the vector s, M is the length of the new vector v after dimensionality reduction, and M can be set according to requirements; step is the step size, and step s (j) is summed to form 1 v (i);

s312, normalizing the signal features after dimension reduction:

wherein v' is a normalized signal feature vector, and each component range is [0,1 ];

s313, according to the normalized signal characteristics, similarity calculation is respectively carried out on the returned audio and the reference audio, wherein the similarity calculation method comprises the following steps:

wherein, a and b are normalized vectors after dimension reduction of characteristics of the returned audio and the reference audio signal, and similarity of audio frequency spectrum centroid, short-time average energy and short-time zero-crossing rate characteristics are respectively marked as sim₁、sim₂、sim₃。

Further, the semantic similarity calculation in S3 includes:

s321, calculating the number of common keywords in the semantic features of the returned audio and the reference audio:

sim_num＝NUM(key_set₀∩key_set₁)

wherein NUM (.) represents the number of elements in the collection;

s322, semantic similarity is calculated according to the number of the common keywords:

wherein threshold is a decision threshold, HIGH represents HIGH similarity, and LOW represents LOW similarity.

Further, in step 4, the multi-level fusion decision rule is: firstly, according to semantic similarity comparison, when the semantic comparison can not be judged to be similar, signal similarity comparison is carried out, and if the overall comparison result is smaller than a similar threshold value, the fact that the far-end receiving node has a black broadcast phenomenon is indicated.

Further, the specific calculation method of the comparison result is as follows:

wherein, Final _ sim is the overall comparison result, and signal _ sim is the signal characteristic comparison result.

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: the invention has better robustness, and can better inhibit the influence of noise and transmission delay: when a single characteristic fails due to noise and the like, reference can be made through other characteristics; meanwhile, the influence of transmission delay on semantic analysis is relatively small, and sliding window matching is not needed in a certain range. The stability of the semantic analysis under the delay condition can offset the instability of the signal analysis under the delay condition to a certain extent.

Drawings

Fig. 1 is a diagram of a black broadcast audio recognition process of the present invention.

Fig. 2 is a diagram of a semantic feature extraction process in the present invention.

FIG. 3 is a flow chart of forming a word frequency dictionary in the present invention.

Fig. 4 is a frequency waveform characteristic diagram of a segment of audio under superposition of noise and delay effects in an embodiment of the invention.

Fig. 5 is a diagram of extracting signal features of the audio of line 1 in fig. 4.

Fig. 6 is a diagram of signal features for extracting the 2 nd row audio in fig. 4.

Fig. 7 is a diagram of signal features for extracting the 3 rd row audio in fig. 4.

Fig. 8 is a diagram of extracting signal features of the 4 th row of audio in fig. 4.

Fig. 9 is a diagram of extracting signal features of the 5 th row of audio in fig. 4.

Fig. 10 is a diagram of signal features for extracting the 6 th row of audio in fig. 4.

FIG. 11 is a graph showing the effect of various alignment methods under noisy conditions.

FIG. 12 is a graph showing the effect of various alignment methods under delayed conditions.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a fusion comparison method for black broadcast identification based on broadcast electrical signal features and content semantic features, which includes:

s2, extracting semantic features of the returned audio and the reference audio;

s4, firstly, comparing according to semantic similarity, and if the semantic similarity is judged to be similar, obtaining a comparison result; and if the semantic similarity comparison cannot be judged to be similar, performing signal similarity comparison to obtain a comparison result, and finishing the identification of the black broadcast audio according to the comparison result.

The specific scheme is as follows:

s1, extracting the audio signal characteristic

The signal characteristics used by the invention comprise a frequency spectrum centroid, short-time average energy and a short-time zero-crossing rate, and the 3 characteristics can be calculated by frequency data. The frequency data is from a decoded audio file, such as a wav file.

The spectral centroid describes the brightness of the sound: sounds with dull, low-lying quality tend to have more low-frequency content, with a relatively low spectral centroid; most of the bright, cheerful qualities are concentrated on high frequencies, with relatively high spectral centroids. The method for calculating the mass center of the frequency spectrum comprises the following steps:

wherein, f (n) is the frequency of the audio signal from the audio file. E (n) is the spectral energy of the corresponding frequency after short-time fourier transform of the continuous time domain signal x (t).

The short-term energy/short-term average energy is a statistic of speech energy in a time window, and is an important index for audio feature analysis. The short-term energy has important purposes of distinguishing unvoiced sound and voiced sound and judging voiced segments and unvoiced segments. The short-time average energy calculation method comprises the following steps:

wherein N is the sliding window length.

The short-time average zero crossing rate is a characteristic parameter in the time domain analysis of the voice signal. The zero crossing rate refers to the number of times a signal passes through a zero value in unit time; the zero-crossing rate over a period of time is referred to as the average zero-crossing rate. The short-term average zero-crossing rate can be used for judging the unvoiced sound and the voiced sound of the voice signal. If the zero crossing rate is high, the voice signal is unvoiced; if the zero crossing rate is low, the speech signal is voiced. The short-time zero crossing rate calculation method comprises the following steps:

wherein sign [ ] is a sign function, namely:

the spectral centroid, the short-term energy, and the short-term zero-crossing rate are all represented as vectors with a length of the sliding window size N, and are applied in subsequent multi-feature pairings.

S2, audio semantic feature extraction

As shown in fig. 2, the audio semantic features refer to the Chinese word frequency and the subject word contained in the audio file, which reflect the general meaning of the audio content; the input of the semantic extraction process is an audio file, the output is a word frequency list and a subject word list, and the extraction process specifically comprises the following steps:

In consideration of system robustness and reliability, the speech recognition includes 3 interface channels, which include at least 1 network interface and 1 local interface. Preferably, the network interface can be selected from a hundredth interface, a message flight interface and the like; the local interface may select a pocketspphinx interface. In terms of robustness, when the network interface is interrupted, the local interface can ensure that the black broadcast identification works normally; in terms of reliability, in order to suppress the influence of noise, only a keyword common to a plurality of interface channels is extracted, which is a reliable semantic feature.

The word frequency analysis is needed from the text to the keywords, and the premise of the word frequency analysis is word segmentation. The present invention performs word segmentation by means of an open source jieba tool. The word segmentation mode has 3 types: full mode, precision mode, and search engine mode, where precision mode attempts to cut the sentence most accurately, suitable for text analysis, and the present embodiment employs precision mode.

After word segmentation, word frequency statistics is completed by means of a dictionary data structure, and the specific process is as follows:

as shown in fig. 3, the specific process of forming the word frequency dictionary by the word frequency analysis includes:

and S225, forming a word frequency dictionary.

Due to the fact that noise exists in an actual system and influences the voice recognition effect, word frequencies counted by multiple channels may have differences. And summarizing the word frequency dictionary formed by the plurality of channels, and adding the weights of the word frequency dictionary. The weight is increased more for the vocabulary existing in a plurality of channels; the vocabulary only appears in individual channels, the weight increase is not obvious, and the vocabulary with the word frequency larger than the set threshold is used as the key word. In this embodiment, the keywords will also be abstracted into vectors for subsequent multi-feature comparison.

S3 similarity evaluation of signal features and semantic features

The signal feature similarity evaluation comprises:

let s1 and s2 be the signal characteristics of the return audio and the reference audio, respectively. Obviously, in the present embodiment, s1 and s2 represent the frequency centroid, the short-term energy characteristic and the short-term zero-crossing rate characteristic of the returned audio and the reference audio. s1, s2 are vectors with equal initial dimensions and related size and audio duration. The method for calculating the similarity of the audio signals comprises the following steps:

(1) and the vector dimension is reduced, the calculated amount is reduced, and the noise interference is inhibited.

The dimensions of s1, s2 are large because of the large number of frequencies contained in the audio file. The dimension reduction is carried out on the signal characteristics, which is not only beneficial to reducing the calculated amount, but also beneficial to inhibiting noise interference. The dimensionality reduction is carried out by adopting the following formula:

wherein, L is the length of the vector s, and M is the length of the new vector v after dimensionality reduction. step is the step size, and step s (j) is summed to form 1 v (i). M is set to 100 in this embodiment. The new vector v maintains the contour characteristics of the original vector s, but reduces the data volume and increases the robustness. It should be noted that: the proper size of M can suppress the effect of delay to some extent.

(2) Normalization

Dimension reduction unifies dimension, and in order to further unify the expression range of each component, the new vector v is further normalized as follows:

v' is a normalized vector, and each component is represented by a range of [0,1 ].

(3) Vector alignment

And comparing the two vectors by calculating the similarity of the two vectors, wherein the similarity can be calculated by a cosine method, a Pearson coefficient method and a distance method. In this example, using the Pearson coefficient method, the similarity is calculated as follows:

wherein, a and b are normalized vectors after dimension reduction of the returned audio and the reference audio respectively. a is_iIs the ith element in a, b_iIs the ith element. The similarity of the audio frequency spectrum centroid, the short-time energy and the short-time zero-crossing rate characteristic is respectively obtained by adopting the formula and is respectively recorded as sim₁、sim₂、sim₃。

Similarity evaluation of semantic levels:

for both the return audio and the reference audio, semantic features are defined as:

where j is 0, the reference audio feature is represented, and j is 1, the returned audio feature is represented.

Represents the vocabulary in the reference audio,

representing the word frequency number of the vocabulary;

representing the vocabulary in the returned audio,

representing the word frequency of the vocabulary. Reference audio and return audioThe number of words in the frequency may be different, and is respectively expressed as N₁、N₂。

Firstly, keyword analysis is carried out on the returned audio and the reference audio, and the words with higher word frequency number are keywords, and the method comprises the following steps:

key_set_j＝(key₁,key₂,…key_i…)

Taking the number of common keywords in the two:

sim_num＝NUM(key_set₀∩key_set₁)

where NUM (.) represents the number of elements in the fetch set. Thus, the calculation method of semantic similarity can be expressed as:

in the above formula, sim₄I.e., semantic feature similarity, to distinguish sim₁、sim₂、sim₃. the threshold is a decision threshold, and in this embodiment, the threshold is 2, which means that if there are 2 or more words in the keyword, the similarity is considered to be high, otherwise, the similarity is low.

S4, comprehensive evaluation

Noise and delay have different effects on different characteristics. Specifically, the characteristics of the invention are that the noise has a large influence on the audio recognition, and further has a large influence on semantic analysis; but the delay has less impact on semantic analysis. In practical systems, noise can be suppressed by a variety of means, but delay is always present; for some features, the time delay may introduce additional registration operations. Therefore, the invention adopts a multi-stage comparison evaluation method, firstly, the semantic similarity is compared, if the semantic similarity is judged to be high similarity, the comparison result is obtained; if the semantic similarity is judged to be low similarity, comparing the signal similarity to obtain a comparison result, finishing the identification of the black broadcast audio according to the comparison result, wherein the expression of the similarity is as follows:

the expression of the above formula shows that semantic feature comparison is firstly carried out, when the semantic feature comparison cannot be judged to be similar, then signal level multi-feature comparison is carried out, and the signal feature comparison refers to the features of frequency center, energy, zero crossing rate and the like.

And when the comparison result Final _ sim is smaller than a certain threshold value, the frequency point is indicated to have a black broadcast phenomenon possibly at a far-end receiving node, and the frequency point needs to be immediately alarmed and requests related personnel to process.

The present embodiment provides an experimental testing process for extracting and comparing signal features and semantic features in the presence of noise and time-delay interference, and the duration of audio in all comparisons is about 30 seconds. Fig. 4 is a frequency waveform characteristic of a section of audio under superposition of noise and delay effects. Behavior 1 is noise-free and delay-free audio, representing the original reference signal; lines 2 through 6 represent noisy delayed audio received remotely. The noise of the 2 nd and 3 rd rows is small, the noise of the 4 th and 5 th rows is large, the noise of the 6 th row is maximum, the delay of the 2 nd, 4 th and 6 th rows is small, and the delay of the 3 rd and 5 th rows is large. It can be seen that: as noise and delay increase, the reference audio and far-end audio signals differ at the same time.

Fig. 5-10 reflect the feature extraction effect of the comparison method of the present invention on the audio of fig. 4, including signal features and semantic features, specifically, fig. 5 is a signal feature diagram extracted by analyzing the spectrum center, short-term energy, zero-crossing rate and semantic features of the audio (reference) of row 1 of fig. 4, where the semantic features: "computer science", "research", "old talk", "programming", "problem", "progress", "update", "difficult", "maybe", "why"; fig. 6 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio of line 2 of fig. 4, the semantic features: "computer science", "old life", "programming", "research", "book", "progress", "back", "hard race", "learning", "maybe"; fig. 7 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 3 of fig. 4, the semantic features: "computer science", "old-fashioned", "programming", "do not ask", "question", "book", "back", "hard race", "learning", "may"; fig. 8 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 4 of fig. 4, where the semantic features are: [ 'computer science', 'old age', 'programming', 'study', 'back', 'up-to-date', 'difficult track', 'learning', 'calculation', 'maybe' ]; fig. 9 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 5 of fig. 4, the semantic features: "computer science", "old-fashioned", "programming", "research", "question", "book", "cell phone", "back", "hard track", "learning"; fig. 10 is a signal feature diagram extracted by analyzing the spectral center, the short-term energy, the zero-crossing rate and the semantic features of the audio in line 6 of fig. 4, the semantic features: [ 'science' ].

It can be seen that the signal level features have been significantly different, and sliding window matching must be introduced to improve the comparison accuracy, which in turn leads to an order of magnitude increase in the calculation workload and also causes errors; semantic level features still maintain relative stability. It can also be seen from fig. 5-10 that the alignment can be done semantically directly in most cases using the alignment method of the present invention without additional time-delay induced registration operations. However, when the noise is large to a certain degree, the semantic feature extraction is invalid and cannot be used as an audio comparison basis, and signal level features are used for comparison, although the signal features are also greatly influenced at this time.

In another embodiment, the comparison between the returned audio and the reference audio is performed by relying on a real broadcast transmitter subsystem, a far-end node acquisition returning subsystem and a data center analysis subsystem, and the comparison effect between the signature comparison method and the conventional signal level single-feature comparison method and the conventional signal level multi-feature comparison method is obtained by real acquisition data analysis and theoretical analysis, as shown in fig. 11, which reflects the relationship between the comparison method and the returned audio noise intensity without time delay: the accuracy of various comparison methods in the initial stage is high; as the noise intensity begins to increase, the accuracy of each comparison method begins to decrease, but the comparison method of the present invention and the traditional method of comparison by adopting a plurality of characteristics are relatively less affected; with the further increase of noise, the comparison method of the invention degenerates into a signal level multi-feature comparison method, and the effects of various comparison methods gradually converge.

Fig. 7 reflects the correlation between the comparison method and the return audio delay: the accuracy of various comparison methods in the initial stage is high; with the increase of the delay, the accuracy of the comparison method based on the signal begins to decrease, but the accuracy of the comparison method of the invention basically maintains unchanged; with the further increase of noise, the comparison method of the invention degenerates into a signal level multi-feature comparison method; finally, the comparison accuracy is converged with the further increase of the time delay. With reference to fig. 11 and 12, the comparison method of the present invention has the best overall performance and the strongest tolerance to interference under the condition of noise and delay in the returned audio.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

Any feature disclosed in this specification may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

Claims

1. A black broadcast audio recognition method, comprising:

s2, extracting semantic features of the returned audio and the reference audio;

2. The black broadcasting audio recognition method of claim 1, wherein the signal characteristics comprise a spectral centroid, a short-time average energy and a short-time zero-crossing rate, which are calculated from the frequency data of the decoded audio file in S1.

3. The black broadcast audio recognition method according to claim 1, wherein ss1S2 specifically includes:

4. The black broadcasting audio recognition method of claim 1, wherein in the S21, the voice recognition interfaces include 3 interfaces, at least 1 network interface and 1 local interface.

5. The black broadcasting audio recognition method of claim 1, wherein in S22, the specific process of forming the word frequency dictionary by word frequency analysis comprises:

and S225, forming a word frequency dictionary.

6. The black broadcast audio recognition method according to claim 1, wherein the S23 specifically includes,

s231, summarizing the word frequency dictionary:

Represents the vocabulary in the reference audio,

representing the word frequency number of the vocabulary;

representing the vocabulary in the returned audio,

key_set_j＝(key₁,key₂,…key_i…)

7. The method for identifying black broadcast audio according to claim 1, wherein the signal similarity calculation in S3 specifically comprises:

s311, performing a dimensionality reduction process on the signal features to form a new vector v ═ [ v (0), v (1),.., v (m);

s312, normalizing the signal features after dimension reduction:

8. The black broadcast audio recognition method of claim 1, wherein the semantic similarity calculation in S3 comprises:

sim_num＝NUM(key_set₀∩key_set₁)

wherein NUM (.) represents the number of elements in the collection;

9. The method of claim 1, wherein in S4, the specific calculation method of the comparison result is:

wherein, the Final _ sim is a comparison result, the signal _ sim is a signal characteristic comparison result, when the comparison result Final _ sim is smaller than a threshold value, it is determined that the frequency point has a black broadcast phenomenon at the remote receiving node, the threshold value is set between 0.3 and 1, and the higher the threshold value is set, the stricter the determination is.