CN116072123A

CN116072123A - Broadcast information playing method and device, readable storage medium and electronic equipment

Info

Publication number: CN116072123A
Application number: CN202310202075.7A
Authority: CN
Inventors: 邱晓健; 连峰; 邱正峰; 崔韧; 吴鼎元
Original assignee: Nanchang Hang Tian Guang Xin Technology Co ltd
Current assignee: Nanchang Hang Tian Guang Xin Technology Co ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-05-05
Anticipated expiration: 2043-03-06
Also published as: CN116072123B

Abstract

The invention discloses a broadcast information playing method, a device, a readable storage medium and electronic equipment, wherein the broadcast information playing method comprises the following steps: acquiring voice information acquired by a microphone, and extracting acoustic features in the voice information; inputting the acoustic features into a voiceprint recognition model to identify the current voice personnel; judging whether the current voice personnel are personnel in a preset list according to the identification result; if the current voice person is a person in the preset list, sending the voice information to a broadcasting terminal for playing; if the current voice personnel are not personnel in the preset list, extracting the content of the voice information, and analyzing to judge whether the content of the voice information meets the broadcasting requirement; when the broadcasting requirement is met, the voice information is sent to the broadcasting terminal for playing.

Description

Broadcast information playing method and device, readable storage medium and electronic equipment

Technical Field

The present invention relates to the field of broadcasting devices, and in particular, to a broadcasting information playing method and apparatus, a readable storage medium, and an electronic device.

Background

The broadcasting system is widely applied to various fields, and places such as campuses, hospitals, parks, markets and the like are provided with the broadcasting system and mainly used for music playing, emergency notification, news broadcasting, paging and the like. The broadcasting terminal, such as a sound box, is a terminal device of a network broadcasting system, and is in wireless communication connection with an upper computer (such as a server) through a switch.

The existing broadcasting system generally comprises a control platform, at least one microphone and at least one broadcasting terminal which are connected with the control platform, any person can use the broadcasting system to broadcast information, the use of the broadcasting system cannot be effectively controlled, abuse of the broadcasting system is caused, and meanwhile poor information is easily spread.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a broadcast information playing method, apparatus, readable storage medium and electronic device, aiming at the problem that the use of the broadcast system in the prior art cannot be effectively controlled.

The invention discloses a broadcast information playing method, which comprises the following steps:

acquiring voice information acquired by a microphone, and extracting acoustic features in the voice information;

inputting the acoustic features into a voiceprint recognition model to identify the current voice personnel;

judging whether the current voice personnel are personnel in a preset list according to the identification result;

when the current voice personnel are personnel in a preset list, sending the voice information to a broadcasting terminal for playing;

when the current voice personnel are not personnel in a preset list, extracting the content of the voice information, and analyzing to judge whether the content of the voice information meets the broadcasting requirement;

and when the content of the voice information meets the broadcasting requirement, sending the voice information to a broadcasting terminal for broadcasting.

Further, in the broadcast information playing method, the step of extracting acoustic features in the voice information includes:

extracting MEL spectrum cepstrum features and Bottleneck features in the voice information;

calculating the weight coefficient of each dimension characteristic component of the MEL spectrum cepstrum feature, and carrying out weighted calculation on the MEL spectrum cepstrum feature according to the weight coefficient of each dimension characteristic component;

and carrying out feature fusion on the MEL spectrum cepstrum feature and the Bottleneck feature after weighted calculation to obtain acoustic features in the voice information.

Further, in the broadcast information playing method, the step of calculating the weight coefficient of each dimension feature component of the MEL spectrum cepstrum feature includes:

calculating contribution degrees of each dimension characteristic component of the MEL spectrum cepstrum characteristic to the speaker identity recognition rate respectively;

carrying out standardization processing on the contribution degree of each dimension characteristic component to the speaker identity recognition rate by adopting a min-max standardization method;

and determining the weight coefficient of each dimension characteristic component according to the contribution degree after the normalization processing.

Further, in the broadcast information playing method, the step of extracting the bottleck feature in the voice information includes:

pre-emphasis, framing and windowing are carried out on the voice information;

converting the processed voice information through FFT, and obtaining a corresponding frequency spectrum after taking an absolute value or a square value;

inputting the corresponding frequency spectrum into a Mel filter bank, and obtaining the Mel frequency spectrum output by the Mel filter bank;

taking logarithm of the MEL spectrum to obtain FBanks characteristics;

inputting the FBanks characteristics into a DNN model, and extracting node excitation values of a Bottleneck layer in the DNN model to obtain Bottleneck characteristics.

Further, in the broadcast information playing method, the step of extracting the content of the voice information and analyzing to determine whether the content of the voice information meets the broadcast requirement includes:

identifying the content in the voice information through a voice identification algorithm, and matching with a sensitive word database;

judging whether the voice information contains sensitive words or not according to the matching result;

if not, determining that the voice information meets the broadcasting requirement.

The invention also discloses a broadcasting information playing device, which comprises:

the feature extraction module is used for acquiring voice information acquired by the microphone and extracting acoustic features in the voice information;

the identity recognition module is used for inputting the acoustic characteristics into a voiceprint recognition model so as to recognize the identity of the current voice personnel;

the first judging module is used for judging whether the current voice personnel are personnel in a preset list or not according to the identification result;

the first sending module is used for sending the voice information to a broadcasting terminal for playing when the current voice personnel are personnel in a preset list;

the second judging module is used for extracting the content of the voice information when the current voice personnel are not personnel in a preset list and analyzing the content to judge whether the content of the voice information meets the broadcasting requirement or not;

and the second sending module is used for sending the voice information to a broadcasting terminal for playing when the content of the voice information meets the broadcasting requirement.

Further, in the broadcast information playing device, the feature extraction module is specifically configured to:

Further, in the broadcast information playing device, the step of calculating the weight coefficient of each dimension feature component of the MEL spectrum cepstrum feature includes:

The invention also discloses a readable storage medium, on which a computer program is stored, which when executed by a processor, implements the broadcast information playing method of any one of the above.

The invention also discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the broadcast information playing method of any one of the above is realized when the processor executes the computer program.

According to the invention, the voice information collected by the microphone is subjected to acoustic feature extraction, the identity of the current voice personnel is identified by utilizing the voiceprint identification model, whether the current voice personnel is a person in the preset list is judged according to the identification result, if so, the voice information of the current voice personnel is played, if not, the content of the voice information of the current voice personnel is analyzed, and whether the content meets the broadcasting requirement is judged, and if so, the voice information of the current voice personnel is played. The invention standardizes the use of the broadcasting system by identifying the identity of the current voice personnel and judging the voice content.

Drawings

Fig. 1 is a flowchart of a broadcast information playing method in an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating steps of acoustic feature extraction in an embodiment of the present invention;

fig. 3 is a block diagram of a broadcast information playing device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Embodiments of the present invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.

Referring to fig. 1, the broadcast information playing method in the embodiment of the invention includes steps S11 to S15.

Step S11, voice information collected by a microphone is obtained, and acoustic features in the voice information are extracted.

Typically a broadcast system will be provided with one or more microphones and at least one broadcast terminal. The user can broadcast voice information through one of the microphones.

Because the voice of each person speaking is different, different speakers can be identified according to the voice characteristics of the speakers, thereby managing the users of the broadcasting system and preventing abuse.

When the voice information of the current voice person acquired by the microphone is acquired, the voice information is extracted, and the acoustic feature is a voiceprint feature for identifying the current voice person, and in one embodiment of the invention, the acoustic feature may be a fusion feature of MEL frequency spectrum cepstrum feature (MEL-frequencycepstral coefficient, MFCC) and bottleck feature, for example. The MEL frequency spectrum cepstrum features are coefficients for forming a Mel frequency cepstrum, and the auditory perception of human ears is emphasized, so that the MEL frequency spectrum cepstrum can well reflect the characteristics of superficial speech of different speakers, and has good identification.

The Bottleneck features can be extracted through a Deep Neural Network (DNN), and a hidden layer with a small number of nodes is arranged in the DNN, and the hidden layer is the Bottleneck layer. The excitation value of the node of the Bottleneck layer is Bottleneck characteristic, and the Bottleneck layer contains information with strong differentiation.

The acoustic features are obtained by carrying out feature fusion on the MEL spectrum cepstrum features and the Bottleneck features, and the fused acoustic features can inherit the advantages of the MEL spectrum cepstrum features and the Bottleneck features, so that the voice feature individuality of a speaker is enhanced, and the recognition performance is improved.

Specifically, as shown in fig. 2, in one implementation of the present invention, the step of extracting the acoustic feature in the voice information includes:

step S111, extracting MEL spectrum cepstrum features and Bottleneck features in the voice information;

step S112, calculating weight coefficients of each dimension characteristic component of the MEL spectrum cepstrum feature, and carrying out weighted calculation on the MEL spectrum cepstrum feature according to the weight coefficients of each dimension characteristic component;

and step S113, carrying out feature fusion on the MEL frequency spectrum cepstrum feature and the Bottleneck feature after the weighted calculation to obtain acoustic features in the voice information.

Specifically, the step of extracting MEL spectrum cepstrum features in the voice information includes:

pre-emphasis, framing and windowing are carried out on the voice information;

and carrying out cepstrum analysis on the MEL spectrum to obtain MEL spectrum cepstrum characteristics.

The process of pre-emphasis, framing and windowing of the voice information can reduce the interference of noise signals, enhance the signal-to-noise ratio of the voice signals and improve the accuracy. And (3) converting each frame of processed voice information through FFT (fast Fourier transform) and taking an absolute value or a square value to obtain a corresponding energy spectrum, inputting the spectrum into a Mel filter bank, and converting the physical frequency scale of the spectrum into a Mel scale by the Mel filter bank, namely converting a linear natural spectrum into a Mel spectrum capable of reflecting human auditory characteristics.

Cepstrum analysis of MEL spectrum is mainly to perform logarithmic sum inverse transformation on MEL spectrum, which can be usually implemented by DCT discrete cosine transform. And after carrying out cepstrum analysis on the MEL frequency spectrum, obtaining an MEL frequency cepstrum coefficient, wherein the MEL frequency cepstrum coefficient is the MEL frequency spectrum cepstrum characteristic.

Specifically, the step of extracting the bottleck feature in the voice information includes:

pre-emphasis, framing and windowing are carried out on the voice information;

taking logarithm of the MEL spectrum to obtain FBanks characteristics;

The Bottleneck feature is an excitation value of a node of the Bottleneck layer in the DNN network, in this embodiment, the DNN network is used as a feature extractor, the input is FBanks feature, the output is speaker identity, and the Bottleneck feature is extracted. In the cepstrum analysis of the MEL spectrum, the MEL spectrum is obtained by taking the logarithm, and the logarithmic energy value is calculated by taking the logarithm, so that the FBanks characteristic is obtained.

The MEL spectrum cepstrum features comprise multidimensional feature components, and as the identification capacity of each dimensional feature component for identifying a speaker is different, in some embodiments of the invention, the whole MEL spectrum cepstrum features are weighted and calculated according to the weight coefficient of each dimensional feature component of the MEL spectrum cepstrum features, thereby improving the characterization capacity of the MEL spectrum cepstrum features and the distinguishing property.

Specifically, in one implementation of the present invention, the weight coefficient of each dimension feature component may be calculated according to the following formula:

wherein r is _p And N represents the total dimension of the MEL spectrum cepstrum feature as the weight coefficient of the p-th dimension feature component.

It can be appreciated that in another implementation of the present invention, the weight coefficient of each dimension feature component may also be determined according to the contribution degree of each dimension feature component to the speaker recognition rate, so as to highlight the feature with large recognition contribution, and improve the overall recognition rate. In specific implementation, the contribution degree of each dimension characteristic component to the speaker recognition rate can be calculated by adopting an increasing and decreasing component method, and the calculation formula is as follows:

wherein->

For the recognition rate of MEL spectral cepstrum features from i-dimension to j-dimension, N represents the total dimension of MEL spectral cepstrum features. R (i) represents the average contribution value of the feature component of the i-th dimension to the recognition rate, a positive value of R (i) represents that the recognition rate is improved by adding the feature, and a negative value of R (i) represents that the recognition rate is reduced by adding the feature.

After the contribution degree of each dimension characteristic component to the recognition rate is obtained, the contribution degree of each dimension characteristic component is subjected to standardization processing, for example, a min-max standardization method can be adopted for processing. Specifically, the characteristic component with the largest contribution degree is set as 1, the characteristic component with the smallest contribution degree is set as 0.5, and based on the characteristic component, the weight coefficient is set according to the min-max standardization method for each dimension characteristic component of the MEL spectrum cepstrum characteristic, so that the weight coefficient of each dimension characteristic component is limited within [0.5,1 ].

Furthermore, the calculated weight coefficient Fourier series of each dimension characteristic component can be fitted, so that the weight coefficient is excessively smoother.

And carrying out feature fusion on the MEL spectrum cepstrum features and the Bottleneck features after weighted calculation, namely carrying out splicing fusion on the two features in a vector dimension in a superposition dimension mode to obtain acoustic features containing more feature information.

Step S12, inputting the acoustic features into a voiceprint recognition model to identify the current voice personnel, and judging whether the current voice personnel are the personnel in a preset list according to the recognition result.

And step S13, when the current voice personnel are personnel in a preset list, the voice information is sent to a broadcasting terminal for playing.

And inputting the obtained acoustic features into a voiceprint recognition model, matching with voiceprint feature data of different people, and outputting identity information of the current voice personnel. The voiceprint recognition model is, for example, a UBM/i-vector model, and can accurately recognize identity information of a voice person through data set training in advance.

The preset personnel list records the identity information of a plurality of personnel, and generally records the identity information of the personnel allowed to use the broadcasting system. And comparing the identity information of the current voice personnel output by the voiceprint recognition model with the preset personnel list, and determining whether the current voice personnel are personnel in the preset personnel list. If yes, the broadcasting system is allowed to be used, namely the voice information of the current voice personnel is sent to a broadcasting terminal to be played.

And S14, when the current voice personnel are not personnel in a preset list, extracting the content of the voice information, and analyzing to judge whether the content of the voice information meets the broadcasting requirement.

And step S15, when the content of the voice information meets the broadcasting requirement, the voice information is sent to a broadcasting terminal for playing.

If the current voice personnel is not the personnel in the preset personnel list, carrying out content identification on the voice information of the current voice personnel, and analyzing whether the content meets the broadcasting requirement.

In the specific implementation, the content in the voice information is identified through a voice identification algorithm and is matched with a sensitive word database so as to judge whether the voice information contains sensitive words, if so, the voice information is determined not to be in accordance with the playing requirement, and the voice information is not played and can be early-warned; if not, the voice content is judged to be in accordance with the requirement, and the voice content is sent to a broadcasting terminal for playing.

In this embodiment, acoustic feature extraction is performed on the voice information collected by the microphone, the identity of the current voice personnel is identified by using a voiceprint identification model, whether the current voice personnel is a person in a preset list is determined according to the identification result, if yes, the voice information of the current voice personnel is played, if no, the content of the voice information of the current voice personnel is analyzed, and whether the content meets the broadcasting requirement is determined, if yes, the playing is performed. The embodiment standardizes the use of the broadcasting system by identifying the identity of the current voice personnel and judging the voice content.

Referring to fig. 3, a broadcast information playing device according to an embodiment of the invention includes:

the feature extraction module 31 is configured to obtain voice information collected by the microphone, and extract acoustic features in the voice information;

the identity recognition module 32 is configured to input the acoustic features into a voiceprint recognition model to identify a current voice person;

a first judging module 33, configured to judge whether the current voice person is a person in a preset list according to the recognition result;

a first sending module 34, configured to send the voice information to a broadcast terminal for playing when the current voice person is a person in a preset list;

a second judging module 35, configured to extract the content of the voice information and analyze the content to determine whether the content of the voice information meets the broadcasting requirement when the current voice person is not a person in the preset list;

and the second sending module 36 is configured to send the voice information to a broadcast terminal for playing when the content of the voice information meets the broadcasting requirement.

Further, in the broadcast information playing device, the feature extraction module 31 is specifically configured to:

The broadcast information playing device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing method embodiment where the device embodiment is not mentioned.

In another aspect, referring to fig. 4, an electronic device according to an embodiment of the present invention includes a processor 10, a memory 20, and a computer program 30 stored in the memory and capable of running on the processor, where the broadcast information playing method described above is implemented when the processor 10 executes the computer program 30.

The electronic device may be, but is not limited to, a personal computer, a server, or other computer device. The processor 10 may in some embodiments be a central processing unit (CentralProcessing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code or processing data stored in the memory 20, etc.

The memory 20 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 20 may in some embodiments be an internal storage unit of the electronic device, such as a hard disk of the electronic device. The memory 20 may also be an external storage device of the electronic device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a flash Card (FlashCard) or the like. Further, the memory 20 may also include both internal storage units and external storage devices of the electronic device. The memory 20 may be used not only for storing application software installed in an electronic device, various types of data, and the like, but also for temporarily storing data that has been output or is to be output.

Optionally, the electronic device may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), a network interface, a communication bus, etc., and an optional user interface may further comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (organic light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices. The communication bus is used to enable connected communication between these components.

It should be noted that the structure shown in fig. 4 does not constitute a limitation of the electronic device, and in other embodiments the electronic device may comprise fewer or more components than shown, or may combine certain components, or may have a different arrangement of components.

The present invention also proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a broadcast information playback method as described above.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus (e.g., a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus). For the purposes of this description, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A broadcast information playing method, comprising:

2. The broadcasting information broadcasting method of claim 1, wherein the step of extracting acoustic features in the voice information comprises:

3. The broadcast information playback method of claim 2, wherein the step of calculating weight coefficients for each dimensional feature component of the MEL spectral cepstrum feature comprises:

4. The broadcast information playing method of claim 2, wherein the step of extracting the bottleck feature in the voice information includes:

pre-emphasis, framing and windowing are carried out on the voice information;

taking logarithm of the MEL spectrum to obtain FBanks characteristics;

5. The broadcasting information broadcasting method of claim 1, wherein the step of extracting the contents of the voice information and analyzing to determine whether the contents of the voice information meet broadcasting requirements comprises:

6. A broadcast information playback apparatus, comprising:

7. The broadcast information playback apparatus of claim 6, wherein the feature extraction module is specifically configured to:

8. The broadcast information playback apparatus of claim 7, wherein the step of calculating a weight coefficient for each dimensional feature component of the MEL spectral cepstrum feature comprises:

9. A readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the broadcast information playing method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the broadcast information playing method according to any one of claims 1 to 5 when executing the computer program.