WO2020233068A1

WO2020233068A1 - Conference audio control method, system, device and computer readable storage medium

Info

Publication number: WO2020233068A1
Application number: PCT/CN2019/121711
Authority: WO
Inventors: 齐燕
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2019-05-21
Filing date: 2019-11-28
Publication date: 2020-11-26
Also published as: CN110300001B; CN110300001A

Abstract

The present application provides a conference audio control method, system, device based on voice detection technology, and a computer readable storage medium, the method includes: receiving conference audio, performing voice detection on the conference audio, determining whether the conference audio includes user voice; if the conference audio includes user voice, extracting the user voice in the conference audio, converting the user voice into text data; comparing and matching the text data with preset conference keywords, and determining whether to output the conference audio according to the matching result of the text data and the conference keywords. The present application can automatically mute users who are not speaking, reduce manual operations, and improve efficiency.

Description

Conference audio control method, system, equipment and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 21, 2019, the application number is 201910432253.9, and the invention title is "Conference audio control method, system, equipment, and computer-readable storage medium", and its entire content Incorporated in the application by reference.

Technical field

This application relates to the technical field of conference audio control, and in particular to a conference audio control method, system, device, and computer-readable storage medium.

Background technique

In the current multi-party conference system, when multiple people are connected, it is usually necessary to manually control whether the audio of each participant is turned on. This requires a conference initiator to constantly check if anyone is speaking and turn on the microphone of the party. This kind of operation requires a lot of manual control, the degree of automation is low, and the meeting efficiency is low.

Summary of the invention

The main purpose of this application is to provide a conference audio control method, system, equipment, and computer-readable storage medium, aiming to solve the technical problem of low intelligence of the existing conference audio control system.

In order to achieve the above objective, this application provides a conference audio control method. The conference audio control method includes the following steps:

Receiving conference audio, performing voice detection on the conference audio, and determining whether the conference audio includes user voice;

If the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

Comparing and matching the text data with preset conference keywords, and judging whether to output the conference audio according to the matching result of the text data and the conference keywords;

Wherein, the step of performing voice detection on the conference audio and judging whether the conference audio includes user voice includes:

Extract audio frames from the conference audio, and obtain signal energy of the audio frames;

Output user mute prompt, collect background noise when there is no user voice, and obtain background noise energy;

A preset energy threshold is calculated based on the background noise energy and a preset threshold formula. The threshold formula is: E _rnew = (1-p) E _rold + pE _silence , where E _rnew is the new threshold, E _rold Is the old threshold, E _silence is the background noise energy, p is the weighted value, and p satisfies 0<p<1;

Comparing the signal energy of the audio frame with a preset energy threshold;

If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame.

In addition, in order to achieve the above objective, the present application also provides a conference audio control system, the conference audio control system including:

A voice detection module, which receives conference audio, performs voice detection on the conference audio, and determines whether the conference audio contains user voice;

A text conversion module, if the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

A matching output module that compares and matches the text data with preset conference keywords, and determines whether to output the conference audio according to the matching result of the text data and the conference keywords;

The voice detection module is further configured to extract audio frames from the conference audio and obtain the signal energy of the audio frames; compare the signal energy of the audio frames with a preset energy threshold; if the If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame;

The voice detection module is also used to output a user mute prompt, collect background noise in the state of no user voice, and obtain background noise energy; calculate a preset energy threshold based on the background noise energy and a preset threshold formula , The threshold formula is: E _rnew = (1-p) E _rold + pE _silence , where E _rnew is the new threshold, E _rold is the old threshold, E _silence is the background noise energy, p is the weighted value, and p Satisfy 0<p<1.

In addition, in order to achieve the above object, the present application also provides a conference audio control device, the conference audio control device includes a processor, a memory, and computer-readable instructions stored on the memory and executable by the processor , Wherein when the computer-readable instructions are executed by the processor, the steps of the above-mentioned conference audio control method are realized.

In addition, in order to achieve the above objective, the present application also provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the implementation is as described above The steps of the conference audio control method.

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

FIG. 1 is a schematic structural diagram of a conference audio control device in a hardware operating environment involved in a solution of an embodiment of the present application;

2 is a schematic flowchart of an embodiment of an audio control method for a conference according to the application;

FIG. 3 is a schematic diagram of functional modules of an embodiment of the conference audio control system of this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

Please refer to Figure 1, which is a schematic diagram of the hardware structure of the conference audio control device provided by this application.

The conference audio control equipment can be a PC, or smart phones, tablet computers, portable computers, desktop computers and other equipment. Conference members participate in the meeting through the conference audio control equipment. The conference audio control equipment can be equipped with audio and video capture devices, or The meeting audio control equipment is connected to the audio and video acquisition equipment. The meeting audio control equipment can also be equipped with a display device and an audio output device to display the meeting video and output the meeting audio; optionally, the meeting audio control equipment can also be a server device , Connect the conference terminals distributed in different addresses, receive the conference audio sent by the conference terminal, and output the conference audio that can be output after analysis to the conference terminal.

The conference audio control device may include components such as a processor 101 and a memory 201. In the conference audio control device, the processor 101 is connected to the memory 201, and computer readable instructions are stored on the memory 201. The processor 101 can call the computer readable instructions stored in the memory 201 and implement the following implementations of the conference audio control method Example steps.

Those skilled in the art can understand that the structure of the conference audio control device shown in FIG. 1 does not constitute a limitation on the conference audio control device, and may include more or less components than shown in the figure, or a combination of certain components, or different The layout of the components.

Based on the above structure, the following embodiments of the conference audio control method of the present application are proposed.

This application provides a conference audio control method.

Referring to Fig. 2, Fig. 2 is a schematic flowchart of a first embodiment of a conference audio control method according to this application.

In this embodiment, the conference audio control method includes the following steps:

Step S10: Receive conference audio, perform voice detection on the conference audio, and determine whether the conference audio contains user voice;

It can be seen from the above that the conference audio control device can be a conference terminal device. The conference terminal device here refers to the terminal device used by the conference members to participate in the conference. For example, the conference member participates in a corporate department meeting through a smart phone. In this example, A smart phone is the conference audio control device; the conference audio control device can also be a server device, where the server device refers to a device that remotely processes conference data, and processing conference data can refer to the transmission of conference audio from one conference member to other conference members Terminal equipment, for example, server equipment H is connected to conference members A, B, and C. Conference members A, B, and C participate in the conference through three different conference terminal devices a, b, and c. Device a connects conference member A’s The audio is transmitted to the server device H, and then transmitted from the server device H to the conference terminal devices b and c.

In the explanation of each embodiment of the conference audio control method of this application, the conference terminal device is used as the conference audio control device as an example for description, and in the following, the conference audio control device may be referred to as a device for short.

In one embodiment, conference audio refers to conference audio collected by a local device/equipment, that is, an audio collection device (recording device) on the device or an audio collection device external to the device collects audio signals in the space where the audio collection device/equipment is located. The collected audio signal is transmitted to the device, that is, the device receives the local conference audio. For example, conference member A participates in the conference through device a, and the recording device L connected to device a collects the audio signal of the space where conference member A is located and transmits it to device a. Here, the audio signal collected by recording device L is this implementation The conference audio in the example. In this embodiment, before outputting the transmittable conference audio directly or indirectly (through the server) to other conference member terminals, the conference audio is analyzed locally (analysis processing refers to processing operations such as voice detection, text keyword detection, etc. ), instead of outputting the collected conference audio directly or indirectly (through the server) to other conference member terminals through the network bandwidth, thus avoiding unnecessary network transmission of audio that does not need to be output to other conference members, saving network bandwidth and improving the conference Data transmission rate, thereby enhancing the real-time performance of conference data transmission.

In another embodiment, the conference audio refers to the conference audio of other conference members remotely transmitted by the server to the device. For example, the server device H connects the conference members A, B, and C, and the conference members A, B, and C respectively pass three different conferences. The devices a, b, and c of the device participate in the conference, and the device a transmits the audio of the conference member A to the server device H, and then the server device H transmits the audio to the devices b and c. Among them, the conference member A received by the devices b and c The audio of is the conference audio in this embodiment. After the device receives the conference audio of other conference members remotely transmitted by the server to the device, it performs voice detection, text keyword detection and other processing on the received conference audio, and determines whether to output or not to output after the judgment operation.

Perform voice detection on conference audio, that is, detect whether there is user voice in the conference audio, and analyze whether there is voice based on the energy difference of audio signals. The signal-to-noise ratio in the conference scene is usually high, so the corresponding audio energy of the voice is high and background noise The corresponding audio energy is low. By analyzing the energy distribution of the conference audio, it can be detected whether there is speech, and the distribution of speech and noise. If the conference audio does not contain the user's voice, no follow-up operation is performed on the conference audio, and no conference audio is output.

Step S20: If the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

Given that the background noise may also contain other people’s voices or the conference audio contains speech content that has nothing to do with the content of the conference, in order to obtain transmission audio with less noise and achieve better conference effects, this embodiment also filters noise through text content .

Voice-to-text operation can be performed on the conference audio of the preset length to determine whether the content of the conference speech is related to the conference. If it is not, it is likely to be background noise or other sounds that do not need to be transmitted. The corresponding conference audio may not be transmitted. Optionally, the user's voice segment in the conference audio can be extracted first, and the user's voice segment can be determined by analyzing changes in the audio signal energy in the conference audio. The voice energy threshold corresponding to the voice can be obtained, and the audio signal energy corresponding to the audio at each time can be obtained. Comparing with the voice energy threshold, determine the audio segment whose audio signal energy is greater than or equal to the voice energy threshold, and use the audio segment whose audio signal energy is greater than or equal to the voice energy threshold as the user's voice segment. Secondly, the user's voice segment is converted into text, and the text data corresponding to the user's voice segment is obtained. Finally, the text data corresponding to the user's voice segment is compared with preset conference keywords to determine whether the user's voice segment is related to the conference.

Among them, converting the user voice segment into text data includes: dividing the user voice segment into voice frames, and extracting the corresponding acoustic features of each voice frame from each voice frame. The acoustic features here can be MFCC (Mel-Frequency Cepstral Coefficients). ) Features; the acoustic features corresponding to each speech frame are input to the acoustic model, and the acoustic model outputs phonemes, where the acoustic model can be a hidden Markov model or a deep learning model, or a hybrid model of the two; output based on the acoustic model The phonemes of is combined into text words, that is, the text data corresponding to the user's voice segment.

Step S30, comparing and matching the text data with preset conference keywords, and judging whether to output the conference audio according to the matching result of the text data and the conference keywords.

The text data is compared and matched with preset conference keywords to determine whether the user's voice segment is related to the conference, and then to determine whether it is necessary to output conference audio.

The preset conference keywords can be pre-stored in the preset address of the local or server. The keyword library can be preset. The keyword library stores keyword collections corresponding to different theme meetings. The meeting members can select the target meeting theme and then determine the corresponding meeting keywords. Among them, one or more target meeting themes can be selected . Optionally, meeting keywords can also be input or designated by meeting members with special permissions. In each meeting, after the meeting keywords are obtained for the first time, the meeting keywords are cached for quick acquisition and use in subsequent audio control steps of the meeting.

Compare and match the text data with preset conference keywords. The text data is composed of multiple words. Therefore, the text data can be divided into words to obtain text words, and each text word can be compared with the preset conference keywords. Judgment of the same and whether the meaning is similar. If the text word is the same or the meaning is similar to the preset meeting keyword, the text word is successfully matched with the preset meeting keyword.

In one embodiment, as long as there are text words in the text data that successfully match the preset meeting keywords, the text data and the preset meeting keywords will be matched successfully, that is, the user's voice segment is related to the meeting, and it is necessary to output the meeting audio In another embodiment, when the proportion of text words in the text data that are successfully matched with the meeting keywords is greater than the preset value, the text data and the preset meeting keywords are matched successfully, for example, the preset value is 1/50 , After dividing the text data into words, 25 text words are obtained, of which 5 text words that successfully match the preset conference keywords are 5, that is, the proportion of text words that successfully match the conference keywords in the text data is 5/25=1 /5>1/50, the text data matches the preset conference keywords successfully.

Compare and match the text data with the conference keywords, and determine whether the voice content in the conference audio is related to the conference based on the matching result between the text data and the conference keywords. If it is related, the conference audio will be output; if not, the conference will not be output. Audio, where, in one embodiment, the device receives the local conference audio, and after performing the voice detection and text conversion steps in this embodiment, it is determined that the conference audio can be output. The output here refers to: direct or indirect conference audio Output to the terminals of other conference members; in another embodiment, the conference audio refers to the conference audio of other conference members remotely transmitted by the server to the device. After being transmitted to the device, the conference audio is subjected to the voice detection in this embodiment , After the text conversion step, it is determined that the conference audio can be output. The output here refers to: output the conference audio on the local conference terminal.

In this embodiment, by receiving the conference audio, voice detection is performed on the conference audio to determine whether the conference audio contains the user's voice, which can avoid outputting noise that does not contain the user's voice, and can also automatically mute users who are not speaking, and remove Background noise, reduce manual operations, and improve conference efficiency; if the conference audio contains user voice, extract the user voice in the conference audio, and convert the user voice into text data; combine the text data with presets To compare and match the conference keywords, and determine whether to output the conference audio according to the matching result of the text data and the conference keywords, and filter out conference audio irrelevant to the conference according to the voice content, reduce noise interference, and reduce network Bandwidth wasted.

Further, based on the foregoing embodiment, in the second embodiment of the present application, the step of performing voice detection on the conference audio in step S10, and determining whether the conference audio includes a user voice, includes:

Step S11, extract audio frames from the conference audio, and obtain signal energy of the audio frames;

The conference audio can be divided into audio frames according to the preset sampling time. The sampling time can be 2.5ms～60ms, which means that the data volume in the unit of 2.5ms～60ms is taken as an audio frame. A piece of conference audio may be divided into multiple audio frames, and the subsequent energy size determination is performed in a single audio frame. The audio frames in the conference audio can be extracted sequentially according to time sequence.

For the signal energy of the audio frame, the average value of the energy flowing through a unit area of the medium per unit time can be used to express the energy of the sound in this place. The formula is (P*w ² *u*A ² )/2, where , P is the density of the medium, w is the sound frequency, A is the amplitude, and u is the wave speed.

Step S12, comparing the signal energy of the audio frame with a preset energy threshold;

Step S13: If the signal energy of the audio frame is greater than a preset energy threshold, it is determined that the audio frame is a speech frame.

The preset energy threshold refers to the threshold determined by experiments in advance, or it can be an empirical value. If the energy threshold is greater than the preset energy threshold, the corresponding audio frame has a higher energy, and the audio frame is a speech frame, which is less than the preset energy threshold , The corresponding audio frame has lower energy, and the audio frame is a non-speech frame.

The signal energy of the audio frame is compared with the preset energy threshold, and the speech frame and non-speech frame are judged respectively on all audio frames extracted from the conference audio according to the size comparison result.

Optionally, the step S12 includes:

Step S14, output a user mute prompt, collect background noise in the state of no user voice, and obtain background noise energy;

Before the meeting or at the beginning of the meeting, the background noise energy can be collected from the meeting audio in the state of no user voice, and the corresponding preset energy threshold can be calculated.

User mute prompt, that is, a prompt to remind meeting members to keep mute and not to speak. It can be output in voice or text form. Optionally, user mute prompt can include the time to keep mute, such as "please keep mute for 5 seconds", which can be output Countdown to remind meeting members; optionally, the user mute prompt can be kept until the background noise in the state of no user voice is collected. No user voice state, that is, the time period during which the user should remain still after outputting the user mute prompt. Optionally, in order to prevent the user’s voice from being included in the background noise due to the meeting members’ failure to mute the user’s mute prompt, the audio in this state can be collected and voice detected. If there is voice, the user’s mute prompt will be output again. And re-acquire background noise and its energy.

Step S15: Calculate a preset energy threshold based on the background noise energy and a preset threshold formula. The threshold formula is: _Ernew = (1-p)E _rold + pE _silence , where E _rnew is the new threshold , _Erold is the old threshold, E _silence is the background noise energy, p is the weighted value, and p satisfies 0<p<1.

After the background noise energy is obtained, the preset energy threshold can be calculated based on the background noise energy and the preset threshold formula. The preset threshold formula is stored in the preset address. When the preset energy threshold value needs to be calculated, it only needs to be obtained from the preset address, or the obtained preset energy threshold value can be stored in a fixed address, when voice judgment is required , Obtain the preset energy threshold directly from the fixed address for quick voice detection.

This embodiment extracts audio frames from the conference audio, and obtains the signal energy of the audio frame; compares the signal energy of the audio frame with a preset energy threshold; if the signal energy of the audio frame If it is greater than the preset energy threshold, it is determined that the audio frame is a speech frame. At the same time, the preset energy threshold is calculated by using the background noise energy in the state of no user speech and the preset threshold formula, which can smoothly realize whether the audio frame is It is the judgment of the speech frame to judge whether to perform the subsequent speech-to-text operation and output operation.

Further, based on the foregoing embodiment, in the third embodiment of the present application, step S30 includes:

Step S31: Obtain pre-stored meeting materials, obtain a target text set based on the meeting materials, perform word segmentation on the target text in the target text set, and obtain target words after word segmentation;

Conference materials refer to the graphic materials, audio and video materials, etc. related to the conference, which can be uploaded by the conference members and stored in the preset materials address, or corresponding conference materials can be pre-stored for different conference topics.

Obtaining a target text set based on the meeting materials refers to performing image-to-text and audio-to-text operations on the graphic materials and audio-visual materials in the meeting materials to obtain respective corresponding texts as the target text sets for keyword extraction; All target texts in the set are segmented to obtain the segmented words, and the segmented words are used as the target words. Among them, before the audio data in the conference materials is converted into text data, it can be subjected to "noise reduction" processing, after removing the meaningless modal particles in the text data, the text data can be segmented.

Step S32: Obtain the word characteristics of the target word, and calculate the weight value of the target word based on the word characteristics, wherein the word characteristics include at least part of speech, word position, and word frequency;

The word features are extracted for each target word, and the word features include at least part of speech, word position and word frequency. When extracting the part-of-speech features of the target word, compare the target word with words in different part-of-speech libraries to determine the part-of-speech library of the target word. The part-of-speech corresponding to the part-of-speech library is the part of speech of the target word; in extracting the word position of the target word When feature, obtain the position of the target word in the text to which it belongs, which may be title, first paragraph, last paragraph, first sentence, last sentence, etc.; when extracting the word frequency feature of the target word, count the total number of target words in the target text collection The number of times and the total number of occurrences in the text to which it belongs.

Different parts of speech, word positions, and word frequencies correspond to different sub-weight values. Different parts of speech, word positions, and word frequencies can be assigned different sub-weight values in advance. Specifically, for parts of speech, corresponding sub-weight values can be preset for different parts of speech, for example, the sub-weight value of noun verbs is 0.8, the sub-weight value of adjectives/adverbs is 0.5, and the sub-weight values of other parts of speech are 0.

For word positions, the coefficients of words in each position need to be preset to identify the importance of different positions in reflecting the content of the subject. The words that appear in the title reflect the theme better than the words that appear in other positions of the article (such as the beginning of the paragraph, the body, the end of the paragraph), and the words that appear in the beginning of the paragraph reflect the theme better than the words that appear in the end of the paragraph. The weight of words in the text is the smallest. For example, if you assign a coefficient of 0.8 to the title, 0.6 to the beginning of the paragraph, 0.5 to the end of the paragraph, and 0.2 to the body, then for a word, its position corresponds to the sub-weight value (Y):

Y=xl×0.8+x2×0.6+x3×0.5+x4×0.2

Among them, x1 refers to the number of times the word appears in the title; x2 refers to the number of times the word appears at the beginning of the paragraph; x3 refers to the number of times the word appears at the end of the paragraph; x4 refers to the number of times the word appears in the text.

For word frequency, the sub-weight value of the word can be calculated based on the formula M=f/(1+f), where f represents the word frequency of the word in an article. Based on the above formula, the sub-weight value of the word can be increased with the increase of the word frequency Gradually increase, when the word frequency of the word gradually increases, the formula gradually converges to 1, that is, the more the word appears, the greater the possibility of the word as a keyword. At the same time, the increase in possibility is not linear. When it is extremely high, it basically tends to be stable, which is more in line with the reality of the language than the linear formula.

After calculating the sub-weight values corresponding to the part of speech, word position and word frequency, the sub-weight values can be summed to obtain the weight value of the target word.

Step S33: Use the target word with a weight value greater than a preset threshold as a preset conference keyword.

Use all target words with a weight value greater than the preset threshold as the preset meeting keywords. When the weight value is greater than the preset threshold, it means that the corresponding target words are more important in the meeting materials and can be used as meeting keywords. . The preset threshold can be an empirical value.

In this embodiment, word segmentation is performed on pre-stored conference materials, and word feature extraction is performed on the target word obtained by word segmentation, and the weight value of the target word is calculated based on the word feature, wherein the word feature includes at least part of speech, word position, and Word frequency; the target words with a weight value greater than a preset threshold are used as preset meeting keywords, and meeting keywords can be automatically generated according to meeting materials. Compared with manual input of meeting keywords by meeting members, this embodiment can obtain more Objective and comprehensive meeting keywords make it more accurate to judge whether the user's voice is related to the meeting in the subsequent meeting audio.

Further, based on the foregoing embodiment, in the fourth embodiment of the present application, the step of comparing and matching the text data with preset conference keywords in step S30 includes:

Step S34, performing word segmentation on the text data to obtain the discourse keywords after word segmentation;

After segmenting the text data, the segmented words are obtained. All words obtained after word segmentation are used as discourse keywords, and all words obtained after word segmentation can be divided into parts of speech, and nouns, gerunds, and verbs among them are used as discourse keywords.

Step S35, comparing the speech keywords with preset meeting keywords, and judging whether the speech keywords include the meeting keywords;

There may be multiple speech keywords, and there may be multiple preset meeting keywords. Compare each speech keyword with all meeting keywords to determine whether the speech keyword is the same or the same as at least one meeting keyword. /approximate. The “contains” conference keyword in this embodiment refers to the same or the same meaning/similar in meaning to the conference keyword.

Specifically, it is first judged whether the speech keyword is the same as at least one meeting keyword. If it is the same as at least one meeting keyword, it can be determined that the speech keyword contains the meeting keyword. If it is different from all meeting keywords, further Determine whether the speech keywords are the same or similar to at least one meeting keyword. If they are the same or similar to at least one meeting keyword, then it can be determined that the speech keyword contains the meeting keyword. If the meaning is different from all meeting keywords /Approximately, it can be determined that the conference keywords are not included in the speech keywords.

Among them, a corpus can be created in advance, and the corpus stores words with the same/similar meaning to the conference keywords. When judging whether the speech keywords are the same/similar to at least one conference keyword, obtain the same meaning as the conference keywords from the corpus /Approximate related words, compare the utterance keywords with related words to determine whether the utterance keywords are the same as at least one related word. If the utterance keywords are the same as at least one related word, then it can be determined that the utterance keyword and at least one meeting key The meaning of the words is the same/similar.

Step S36: If the speech keywords include the conference keywords, the text data is successfully matched with the conference keywords.

If the speech keywords include meeting keywords, the text data matches the meeting keywords successfully, and the conference audio can be output; otherwise, if the speech keywords do not include meeting keywords, the text data and meeting keywords are not matched successfully , Indicating that the user’s voice in the conference audio may have nothing to do with the conference content, and there is no need to output conference audio.

In this embodiment, as long as the utterance keywords include the meeting keywords, the text data and the meeting keywords are successfully matched, which can avoid the missing of important user voices in the meeting audio due to too high matching requirements.

Further, based on the foregoing embodiment, in the fifth embodiment of the present application, the step of determining whether to output the conference audio according to the matching result of the text data and the conference keyword in step S30 includes:

In step S370, if the text data is successfully matched with the conference keyword, obtain a conference image;

After the text data is successfully matched with the conference keywords, it can be further determined whether to output conference audio based on image analysis. The meeting image in this embodiment refers to the meeting image at the audio source of the meeting, that is, the image of the space where the meeting members from the audio source of the meeting are located. For example, if the conference audio is a local audio collected by a local sound collection device, the conference image is a local image; if the conference audio is an audio in a remote space transmitted by a server network, the conference image is an image corresponding to the remote space. For another example, if the conference audio comes from conference member A, the conference image refers to the image of the space where conference member A is located.

Step S371: Detect the face in the conference image, extract the lip features of the detected face, and determine whether the face meets the speech feature according to the lip feature;

Face recognition is performed on the meeting image to obtain the faces in it. A meeting image may contain multiple faces, then each face will be detected by lip feature and whether it meets the speech characteristics. If the meeting image has at least one person If the face conforms to the speech feature, it can be determined that the face in the meeting image conforms to the speech feature. Based on the location characteristics of the facial features of the face, image recognition of the face can be directly performed to locate the position of the lips. Lip characteristics can be input into a preset language judgment model, and the language judgment model determines whether the face meets the speech characteristics based on the lip characteristics. For the language judgment model, the lip images marked with the speaking and non-speaking mouths can be used as positive and negative examples to train the language judgment model. After the optimal model parameters are obtained, the language with the optimal model parameters will be included. The judgment model is used to judge speech based on lip characteristics.

Step S372: If the human face meets the speech feature, it is determined to output the conference audio.

If the face meets the verbal characteristics, it means that the meeting members in the corresponding space of the meeting audio are speaking. It can be determined that there is the voice of the meeting members in the meeting audio, and the meeting audio needs to be output; if the face does not meet the speech characteristics, the meeting audio corresponds to If the meeting members in the space do not speak, it means that there should not be the speech of the meeting members in the meeting audio, and the existing user voice in the meeting audio is likely to be noise, so it is determined that the meeting audio is not output.

In this embodiment, by performing image recognition on the conference image corresponding to the conference audio, the lip feature of the face in the conference image is extracted, and according to the lip feature, it is determined whether the face meets the speech feature, that is, whether it is speaking or not. If at least one of the faces meets the speech feature, the conference audio can be output. In this way, whether the conference audio should be output can be determined based on the image feature and the audio feature, and more accurate conference audio screening results can be obtained.

Optionally, after the step of detecting the face in the conference image in step S371, the method includes:

Step S373: Perform front and side recognition on the detected face;

The discriminant model for frontal side recognition can be preset, and the face images that have been frontal and side-labeled are used as training samples to train the discriminant model until a discriminant model containing the optimal model parameters is obtained. The detected face image can be input into the The discriminant model is used to output the front and side recognition results.

Step S374, if the face is front, perform the step of extracting the lip features of the detected face;

If the face is front, it means that the meeting members are facing the meeting screen and are seriously participating in the meeting. At the same time, because in the front state, the complete face lips can be detected. Therefore, in order to further accurately filter the necessary output meeting audio, you can The step of extracting the lip features of the detected face is further executed, and it is judged whether it is speaking, that is, steps S371-S372 are executed.

Step S375: If the face is a profile, it is determined not to output the conference audio.

If the face is a profile, the meeting member may need to have a private discussion with other members, and it is determined not to output the meeting audio, which can enhance the flexibility of meeting audio screening, and is also useful for remote meeting scenarios Sex.

In addition, this application also provides a conference audio control system corresponding to the steps of the aforementioned conference audio control method.

Referring to Fig. 3, Fig. 3 is a schematic diagram of the functional modules of the first embodiment of the conference audio control system of this application.

In this embodiment, the conference audio control system of this application includes:

The voice detection module 10 is configured to receive conference audio, perform voice detection on the conference audio, and determine whether the conference audio contains user voice;

The text conversion module 20 is configured to extract the user voice in the conference audio if the conference audio includes user voice, and convert the user voice into text data;

The matching output module 30 is configured to compare and match the text data with preset conference keywords, and determine whether to output the conference audio according to the matching result of the text data and the conference keywords.

Further, the voice detection module 10 is also configured to extract audio frames from the conference audio and obtain the signal energy of the audio frames; compare the signal energy of the audio frames with a preset energy threshold ; If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame.

Further, the voice detection module 10 is also used to output a user mute prompt, collect background noise in the state of no user voice, and obtain background noise energy; calculate the prediction based on the background noise energy and a preset threshold formula The threshold formula is: E _rnew = (1-p) E _rold + pE _silence , where E _rnew is the new threshold, E _rold is the old threshold, E _silence is the background noise energy, and p is Weighted value, p satisfies 0<p<1.

Further, the conference audio control system further includes:

The conference keyword determination module is used to obtain pre-stored conference materials, obtain a target text set based on the conference materials, segment the target text in the target text set, and obtain the target word after word segmentation; obtain the target word Calculate the weight value of the target word based on the word feature, where the word feature includes at least part of speech, word position, and word frequency; the target word with a weight value greater than a preset threshold is taken as a preset Conference keywords.

Further, the matching output module 30 is also used for word segmentation of the text data to obtain speech keywords after word segmentation; comparing the speech keywords with preset conference keywords to determine the key words of the speech Whether the word includes the meeting keyword; if the speech keyword includes the meeting keyword, the text data is successfully matched with the meeting keyword.

Further, the matching output module 30 is further configured to obtain a meeting image if the text data is successfully matched with the meeting keyword; detect the face in the meeting image, and extract the detected person According to the lip characteristics of the face, it is determined whether the human face meets the speech characteristics according to the lip characteristics; if the human face meets the speech characteristics, it is determined to output the conference audio.

Further, the matching output module 30 is also used to perform frontal side recognition of the detected face; if the face is frontal, perform the extraction of the lip features of the detected face Step; if the face is a profile, it is determined not to output the conference audio.

This application also proposes a computer-readable storage medium, which may be a non-volatile readable storage medium on which computer-readable instructions are stored. The computer-readable storage medium can be the memory 201 in the conference audio control device of FIG. 1, or it can be ROM (Read-Only Memory)/RAM (Random Access Memory), magnetic disk, At least one of the optical discs, and the computer-readable storage medium includes several instructions to make a device with a processor (which can be a mobile phone, a computer, a server, a network device, or the conference audio control device in the embodiment of the present application, etc.) to execute The method of each embodiment of this application.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, and device including a series of elements not only includes those elements, but also Including other elements not explicitly listed, or elements inherent to this process, method, or equipment. If there are no more restrictions, the element defined by the sentence "including..." does not exclude the existence of other same elements in the process, method, and equipment that include the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。

The above are only optional embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly applied to other related technologies In the same way, all fields are included in the scope of patent protection of this application.

Claims

A conference audio control method, wherein the conference audio control method includes the following steps:

Receiving conference audio, performing voice detection on the conference audio, and determining whether the conference audio includes user voice;

If the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

Comparing and matching the text data with preset conference keywords, and judging whether to output the conference audio according to the matching result of the text data and the conference keywords;

Wherein, the step of performing voice detection on the conference audio and judging whether the conference audio includes user voice includes:

Extract audio frames from the conference audio, and obtain signal energy of the audio frames;

Output user mute prompt, collect background noise when there is no user voice, and obtain background noise energy;

A preset energy threshold is calculated based on the background noise energy and a preset threshold formula. The threshold formula is: E rnew = (1-p) E rold + pE silence , where E rnew is the new threshold, E rold Is the old threshold, E silence is the background noise energy, p is the weighted value, and p satisfies 0<p<1;

Comparing the signal energy of the audio frame with a preset energy threshold;

If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame.
The conference audio control method according to claim 1, wherein before the step of comparing and matching the text data with preset conference keywords comprises:

Obtain pre-stored meeting materials, obtain a target text set based on the meeting materials, perform word segmentation on the target text in the target text set, and obtain target words after word segmentation;

Obtain the word characteristics of the target word, and calculate the weight value of the target word based on the word characteristics, where the word characteristics include at least part of speech, word position, and word frequency;

The target words with a weight value greater than a preset threshold are used as preset conference keywords.
The conference audio control method according to claim 1, wherein the step of comparing and matching the text data with preset conference keywords comprises:

Perform word segmentation on the text data to obtain the discourse keywords after word segmentation;

Comparing the speech keywords with preset meeting keywords, and judging whether the speech keywords include the meeting keywords;

If the speech keywords include the conference keywords, the text data and the conference keywords are successfully matched.
The conference audio control method according to claim 1, wherein the step of judging whether to output the conference audio according to the matching result of the text data and the conference keywords comprises:

If the text data is successfully matched with the conference keyword, obtaining a conference image;

Detecting the human face in the conference image, extracting lip features of the detected human face, and judging whether the human face meets the speech feature according to the lip feature;

If the face meets the speech feature, it is determined to output the conference audio.
The conference audio control method according to claim 4, wherein after the step of detecting the human face in the conference image includes:

Performing front and side face recognition on the detected face;

If the face is front, execute the step of extracting the lip features of the detected face;

If the face is a profile, it is determined not to output the conference audio.
A conference audio control system, wherein the conference audio control system includes:

The voice detection module is configured to receive conference audio, perform voice detection on the conference audio, and determine whether the conference audio contains user voice;

A text conversion module, configured to extract the user voice in the conference audio and convert the user voice into text data if the conference audio contains user voice;

The matching output module is configured to compare and match the text data with preset conference keywords, and determine whether to output the conference audio according to the matching result of the text data and the conference keywords;

The voice detection module is further configured to extract audio frames from the conference audio and obtain the signal energy of the audio frames; compare the signal energy of the audio frames with a preset energy threshold; if the If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame;

The voice detection module is also used to output a user mute prompt, collect background noise in the state of no user voice, and obtain background noise energy; calculate a preset energy threshold based on the background noise energy and a preset threshold formula , The threshold formula is: E rnew =(1-p)E rold +pE silence , where E rnew is the new threshold, E rold is the old threshold, E silence is the background noise energy, p is the weighted value, and p Satisfy 0<p<1.
8. The conference audio control system of claim 6, wherein the conference audio control system further comprises:

The conference keyword determination module is used to obtain pre-stored conference materials, obtain a target text set based on the conference materials, segment the target text in the target text set, and obtain the target word after word segmentation; obtain the target word Calculate the weight value of the target word based on the word feature, where the word feature includes at least part of speech, word position, and word frequency; the target word with a weight value greater than a preset threshold is taken as a preset Conference keywords.
The conference audio control system according to claim 6, wherein the matching output module is also used to segment the text data to obtain the speech keywords after word segmentation; and compare the speech keywords with the preset meeting The keywords are compared to determine whether the speech keywords include the conference keywords; if the speech keywords include the conference keywords, the text data and the conference keywords are successfully matched.
The conference audio control system according to claim 6, wherein the matching output module is further configured to obtain a conference image if the text data is successfully matched with the conference keyword; and detect persons in the conference image Face, and extract the detected lip features of the face, and determine whether the face meets the speech feature according to the lip feature; if the face meets the speech feature, determine to output the conference audio.
The conference audio control system of claim 9, wherein the matching output module is further configured to perform frontal side recognition of the detected face; if the face is frontal, perform the extraction detection If the face is a profile, it is determined not to output the conference audio.
A conference audio control device, wherein the conference audio control device includes a processor, a memory, and computer-readable instructions stored on the memory and executable by the processor, wherein the computer-readable instructions are When the processor executes, the following steps are implemented:

Receiving conference audio, performing voice detection on the conference audio, and determining whether the conference audio includes user voice;

If the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

Comparing and matching the text data with preset conference keywords, and judging whether to output the conference audio according to the matching result of the text data and the conference keywords;

Wherein, the step of performing voice detection on the conference audio and judging whether the conference audio includes user voice includes:

Extract audio frames from the conference audio, and obtain signal energy of the audio frames;

Output user mute prompt, collect background noise when there is no user voice, and obtain background noise energy;

A preset energy threshold is calculated based on the background noise energy and a preset threshold formula. The threshold formula is: E rnew = (1-p) E rold + pE silence , where E rnew is the new threshold, E rold Is the old threshold, E silence is the background noise energy, p is the weighted value, and p satisfies 0<p<1;

Comparing the signal energy of the audio frame with a preset energy threshold;

If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame.
The conference audio control device according to claim 11, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

Obtain pre-stored meeting materials, obtain a target text set based on the meeting materials, perform word segmentation on the target text in the target text set, and obtain target words after word segmentation;

Obtain the word characteristics of the target word, and calculate the weight value of the target word based on the word characteristics, where the word characteristics include at least part of speech, word position, and word frequency;

The target words with a weight value greater than a preset threshold are used as preset conference keywords.
The conference audio control device according to claim 11, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

Perform word segmentation on the text data to obtain the discourse keywords after word segmentation;

Comparing the speech keywords with preset meeting keywords, and judging whether the speech keywords include the meeting keywords;

If the speech keywords include the conference keywords, the text data and the conference keywords are successfully matched.
The conference audio control device according to claim 11, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

If the text data is successfully matched with the conference keyword, obtaining a conference image;

Detecting the human face in the conference image, extracting lip features of the detected human face, and judging whether the human face meets the speech feature according to the lip feature;

If the face meets the speech feature, it is determined to output the conference audio.
The conference audio control device according to claim 14, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

Performing front and side face recognition on the detected face;

If the face is front, execute the step of extracting the lip features of the detected face;

If the face is a profile, it is determined not to output the conference audio.
A computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following steps are implemented:

Receiving conference audio, performing voice detection on the conference audio, and determining whether the conference audio includes user voice;

If the conference audio includes user voice, extract the user voice in the conference audio, and convert the user voice into text data;

Comparing and matching the text data with preset conference keywords, and judging whether to output the conference audio according to the matching result of the text data and the conference keywords;

Wherein, the step of performing voice detection on the conference audio and judging whether the conference audio includes user voice includes:

Extract audio frames from the conference audio, and obtain signal energy of the audio frames;

Output user mute prompt, collect background noise when there is no user voice, and obtain background noise energy;

A preset energy threshold is calculated based on the background noise energy and a preset threshold formula. The threshold formula is: E rnew = (1-p) E rold + pE silence , where E rnew is the new threshold, E rold Is the old threshold, E silence is the background noise energy, p is the weighted value, and p satisfies 0<p<1;

Comparing the signal energy of the audio frame with a preset energy threshold;

If the signal energy of the audio frame is greater than the preset energy threshold, it is determined that the audio frame is a speech frame.
15. The computer-readable storage medium of claim 16, wherein the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are further implemented:

Obtain pre-stored meeting materials, obtain a target text set based on the meeting materials, perform word segmentation on the target text in the target text set, and obtain target words after word segmentation;

Obtain the word characteristics of the target word, and calculate the weight value of the target word based on the word characteristics, where the word characteristics include at least part of speech, word position, and word frequency;

The target words with a weight value greater than a preset threshold are used as preset conference keywords.
15. The computer-readable storage medium of claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

Perform word segmentation on the text data to obtain speech keywords after word segmentation;

Comparing the speech keywords with preset meeting keywords, and judging whether the speech keywords include the meeting keywords;

If the speech keywords include the conference keywords, the text data and the conference keywords are successfully matched.
15. The computer-readable storage medium of claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

If the text data is successfully matched with the conference keyword, obtaining a conference image;

Detecting the human face in the conference image, extracting lip features of the detected human face, and judging whether the human face meets the speech feature according to the lip feature;

If the face meets the speech feature, it is determined to output the conference audio.
17. The computer-readable storage medium of claim 17, wherein when the computer-readable instructions are executed by the processor, the following steps are further implemented:

Performing front and side face recognition on the detected face;

If the face is front, execute the step of extracting the lip features of the detected face;

If the face is a profile, it is determined not to output the conference audio.