CN111429919B - Crosstalk prevention method based on conference real recording system, electronic device and storage medium - Google Patents
Crosstalk prevention method based on conference real recording system, electronic device and storage medium Download PDFInfo
- Publication number
- CN111429919B CN111429919B CN202010235796.4A CN202010235796A CN111429919B CN 111429919 B CN111429919 B CN 111429919B CN 202010235796 A CN202010235796 A CN 202010235796A CN 111429919 B CN111429919 B CN 111429919B
- Authority
- CN
- China
- Prior art keywords
- speaker
- real
- crosstalk
- microphones
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002265 prevention Effects 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012545 processing Methods 0.000 claims abstract description 47
- 238000013507 mapping Methods 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 6
- 239000013598 vector Substances 0.000 description 28
- 230000006870 function Effects 0.000 description 17
- 238000012549 training Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 10
- 238000009826 distribution Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 239000004973 liquid crystal related substance Substances 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R29/00—Monitoring arrangements; Testing arrangements
- H04R29/004—Monitoring arrangements; Testing arrangements for microphones
- H04R29/005—Microphone arrays
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention relates to a data processing technology and provides a crosstalk prevention method based on a conference real recording system, an electronic device and a storage medium. The method comprises the steps of obtaining voice information of a speaker in real time, inputting the voice information into a pre-trained voiceprint recognition model to obtain real-time voice characteristics of the speaker, judging whether pre-stored voice characteristics of the speaker exist in a pre-built voiceprint library, reading the voice characteristics of the speaker and corresponding labels from the voiceprint library when the pre-stored voice characteristics exist, obtaining microphones corresponding to the voice characteristics of the speaker based on a pre-built mapping relation between each microphone and each speaker label, detecting whether crosstalk occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphones with crosstalk when any microphone in the plurality of microphones has the crosstalk. With the present invention, the microphone that is giving rise to crosstalk can be accurately detected to perform crosstalk prevention processing on the microphone that is giving rise to crosstalk.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a crosstalk prevention method based on a real recording conference system, an electronic device, and a storage medium.
Background
In the use process of the conference real recording system, when the distance between the microphones is too short or the sensitivity of the microphones is too high, sound can be transmitted into other microphones, so that microphone crosstalk is caused, and the accuracy of the conference real recording is seriously affected. The existing conference real recording system on the market cannot automatically detect and process the microphone crosstalk, and mainly because the existing conference real recording system mainly relies on hardware of a microphone to distinguish conference speakers, when the microphone crosstalk is generated, especially when characteristics such as audio stream intensity and the like of the microphone are similar, the microphone cannot be determined, and thus crosstalk prevention processing cannot be performed.
Disclosure of Invention
In view of the above, the present invention provides a crosstalk prevention method, an electronic device and a storage medium based on a real recording conference system, which are aimed at solving the problem that the crosstalk prevention process cannot be performed due to the fact that the microphone generating the crosstalk cannot be automatically detected in the prior art.
In order to achieve the above object, the present invention provides a crosstalk prevention method based on a real recording system, the method comprising:
the acquisition step: acquiring voice information of a speaker in real time, inputting the voice information into a pre-trained voiceprint recognition model, and obtaining real-time voice characteristics of the speaker;
Judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice characteristic of the speaker, detecting whether a crosstalk phenomenon occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphone with crosstalk when any microphone in the plurality of microphones has the crosstalk phenomenon.
Preferably, the judging step includes:
and respectively calculating a first similarity value of the real-time voice characteristic of the speaker and each pre-stored voice characteristic in the voiceprint library by using a first preset calculation rule, and determining the pre-stored voice characteristic of the speaker and the label corresponding to the speaker from the voiceprint library when the first similarity value is greater than or equal to a preset threshold value.
Preferably, the judging step further includes:
And when all the first similarity values are smaller than a preset threshold value, storing the real-time voice information, the labels and the voiceprint features of the speaker into the voiceprint library.
Preferably, the executing step includes:
and converting the real-time voice information of the speaker into text information in real time based on a preset conversion rule, determining the number of microphones in response based on the label of the speaker corresponding to the converted text information, and not executing crosstalk prevention processing operation when the number of microphones is smaller than a preset value.
Preferably, the executing step further includes:
and when the number of the microphones is larger than a preset value, respectively calculating second similarity values among text information corresponding to the microphones by using a second preset calculation rule, and when the second similarity values are larger than or equal to a second preset threshold value, executing crosstalk prevention processing operation on the microphones with crosstalk, otherwise, not executing the crosstalk prevention processing operation.
In order to achieve the above object, the present invention also provides an electronic device including: the system comprises a memory and a processor, wherein the memory stores a crosstalk prevention program based on a conference real recording system, the crosstalk prevention program based on the conference real recording system is executed by the processor, and the following steps are realized:
The acquisition step: acquiring voice information of a speaker in real time, inputting the voice information into a pre-trained voiceprint recognition model, and obtaining real-time voice characteristics of the speaker;
judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice characteristic of the speaker, detecting whether a crosstalk phenomenon occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphone with crosstalk when any microphone in the plurality of microphones has the crosstalk phenomenon.
Preferably, the judging step includes:
and respectively calculating a first similarity value of the real-time voice characteristic of the speaker and each pre-stored voice characteristic in the voiceprint library by using a first preset calculation rule, and determining the pre-stored voice characteristic of the speaker and the label corresponding to the speaker from the voiceprint library when the first similarity value is greater than or equal to a preset threshold value.
Preferably, the executing step includes:
and converting the real-time voice information of the speaker into text information in real time based on a preset conversion rule, determining the number of microphones in response based on the label of the speaker corresponding to the converted text information, and not executing crosstalk prevention processing operation when the number of microphones is smaller than a preset value.
Preferably, the executing step further includes:
and when the number of the microphones is larger than a preset value, respectively calculating second similarity values among text information corresponding to the microphones by using a second preset calculation rule, and when the second similarity values are larger than or equal to a second preset threshold value, executing crosstalk prevention processing operation on the microphones with crosstalk, otherwise, not executing the crosstalk prevention processing operation.
In order to achieve the above object, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a crosstalk prevention program based on a real recording conference system, where when the crosstalk prevention program based on the real recording conference system is executed by a processor, any step in the crosstalk prevention method based on the real recording conference system as described above can be implemented.
The crosstalk prevention method, the electronic device and the storage medium based on the conference real recording system provided by the invention are used for acquiring the voice information of a speaker in real time, inputting a voiceprint recognition model to acquire the voice characteristics of the speaker, judging whether the voice characteristics of the speaker exist in a voiceprint library, reading the voice characteristics of the speaker and corresponding labels when the voice characteristics exist, acquiring microphones corresponding to the speaker based on a pre-established mapping relation, detecting whether the crosstalk phenomenon occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on any microphone which is in the crosstalk phenomenon. Compared with the traditional mode of manually adjusting the sensitivity of the microphone or manually turning off the microphone, the invention can accurately detect the microphone generating crosstalk in real time and execute the crosstalk prevention processing on the microphone generating crosstalk.
Drawings
FIG. 1 is a schematic diagram of an electronic device according to a preferred embodiment of the invention;
FIG. 2 is a schematic block diagram of a preferred embodiment of the crosstalk prevention program based on the real-meeting recording system in FIG. 1;
FIG. 3 is a flow chart of a cross-talk prevention method based on a real-meeting recording system according to a preferred embodiment of the present invention;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a schematic diagram of a preferred embodiment of an electronic device 1 according to the present invention is shown.
The electronic device 1 includes, but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain the original data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.
The memory 11 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic apparatus 1, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are equipped in the electronic apparatus 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic apparatus 1. In this embodiment, the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, such as a program code of the crosstalk prevention program 10 based on a real recording conference system. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, e.g. performing data interaction or communication related control and processing, etc. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, for example, execute the program code of the crosstalk prevention program 10 based on the real recording system of the conference.
The display 13 may be referred to as a display screen or a display unit. The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like in some embodiments. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, for example displaying the results of data statistics.
The network interface 14 may alternatively comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), which network interface 14 is typically used for establishing a communication connection between the electronic apparatus 1 and other electronic devices.
Fig. 1 shows only an electronic device 1 with components 11-14 and a crosstalk prevention program 10 based on a conference recording system, but it is understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.
Optionally, the electronic device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
The electronic device 1 may further include Radio Frequency (RF) circuits, sensors and audio circuits, etc., which will not be described in detail herein.
In the above embodiment, the processor 12 may implement the following steps when executing the crosstalk prevention program 10 based on the meeting real recording system stored in the memory 11:
The acquisition step: acquiring voice information of a speaker in real time, inputting the voice information into a pre-trained voiceprint recognition model, and obtaining real-time voice characteristics of the speaker;
judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice characteristic of the speaker, detecting whether a crosstalk phenomenon occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphone with crosstalk when any microphone in the plurality of microphones has the crosstalk phenomenon.
For a detailed description of the above steps, please refer to the following fig. 2 for a block diagram of an embodiment of the crosstalk prevention program 10 based on the real recording system and fig. 3 for a flowchart of an embodiment of the crosstalk prevention method based on the real recording system.
In other embodiments, the crosstalk prevention program 10 based on the conference recording system may be divided into a plurality of modules, which are stored in the memory 11 and executed by the processor 12 to complete the present invention. The invention may refer to a series of computer program instruction segments capable of performing a specified function.
Referring to FIG. 2, a block diagram of an embodiment of the crosstalk prevention program 10 of FIG. 1 based on a real meeting recording system is shown. In this embodiment, the crosstalk prevention program 10 based on the conference recording system may be divided into: the device comprises an acquisition module 110, a judgment module 120 and an execution module 130.
The acquiring module 110 is configured to acquire, in real time, voice information of a speaker, and input the voice information into a pre-trained voiceprint recognition model to obtain real-time voice characteristics of the speaker.
In this embodiment, the real-time voice information of the speaker in the conference may be acquired in real time by using a sound collecting device such as a terminal device (e.g., a microphone) having a recording function or a video recording device (e.g., a digital video camera) having a video recording function. The audio format of the voice information may be mp3, wma, wav, etc. Specifically, when a speaker at one side of the terminal device starts speaking, the terminal device collects voice information through the voice collection device. In addition, the voice end point detection technology can be used for distinguishing the voice signals and the non-voice signals in the voice of the speaker, removing invalid voice fragments and noise, determining the starting end point and the ending end point of each valid voice fragment, and improving the matching accuracy of the follow-up voice and the offline database. After the voice information of the speaker is obtained, the voice information is input into a pre-trained voiceprint recognition model, and voiceprint characteristics of the voice information can be obtained.
The training step of the voiceprint recognition model comprises the following steps:
a predetermined number of voice messages are obtained from a predetermined voice database (e.g., NIST-SREs), for example, about 6 ten thousand 4 thousand recorded data from 4400 speakers in 2004 through 2010 and a member conference report of the company and lecture audio material. The deep neural network of the x-vector model is trained by using the acquired voice data, so that network parameters which can correctly distinguish voiceprints of different speakers in a training set are trained, and the capability of effectively identifying voiceprint features of speakers outside the training set is improved. Among the available deep neural network models include, but are not limited to, the following: feedforward DNN, CNN, LSTM, transformer.
In this embodiment, the deep neural network model is described by taking feedforward DNN as an example, and includes a voice mel-frequency cepstral coefficients (MCFFs) feature input layer, four NIN (network-in-network) hidden layers processed at the frame level, a statistics pooling layer, two embedded representation layers, and a last SoftMax output layer.
The input data of the input layer are processed MFCCs eigenvectors, MCFFs are cepstrum parameters extracted in the Mel scale frequency domain, the Mel scale describes the nonlinear characteristic of the human ear perception sound frequency, and the relation between the input data and the frequency can be expressed by the following formula:
Where f represents the speech frequency.
The basic operation flow for extracting the MFCCs feature vector comprises the following steps: continuous speech is input, pre-emphasis, framing, windowing, FFT conversion of the signal, passing through a Mel filter bank, logarithmic operation, and dynamic differential parameter extraction. The purpose of pre-emphasis is to boost the high frequency part, flatten the spectrum of the signal, remain in the whole frequency band from low frequency to high frequency, and can use the same signal-to-noise ratio to find the spectrum. Meanwhile, in order to eliminate vocal cords and lip effects in the sounding process, high-frequency parts of the voice signals, which are restrained by the sounding system, are compensated, and high-frequency formants are highlighted. N sampling points are integrated into one observation unit, which is called a frame. Typically, N has a value of 256 or 512, covering a period of about 20-30 ms. Each frame is multiplied by a hamming window to increase the continuity at the left and right ends of the frame. Each frame is then converted into an energy distribution in the frequency domain for observation, and speech features in different energy distributions are extracted.
In this embodiment, n=512 is adopted, the time of one frame is 25ms, the time of the sliding window is 3s, 20-dimensional speech features of MFCCs are extracted for each frame, non-speech frames are filtered through the VAD based on energy, each sliding window reads in an input speech feature vector containing 120 dimensions, and the output dimension is 512 dimensions.
The NIN hidden layer is composed of a plurality of micro-network modules, the parameters of the micro-network modules are shared, the number of network parameters to be trained by the model is reduced, and the hidden layers are connected through a ReLU nonlinear activation function.
In this embodiment, feature vectors at the current time t and the front and rear time are continuously transferred into the input data layer, which includes data in { t-2, t-1, t, t+1, t+2}5 windows, the input dimension of each window data is 120-dimensional, the output is 512-dimensional, { t-2, t, t+2}, { t-3, t, t+3} window data are bonded together as the input data dimension of the first hidden layer and the second hidden layer respectively, 1536 is the output dimension, 512 is the output dimension, the input data of the third and fourth NIN hidden layer are all the current window { t } data, and 512 and 1500 are the output dimensions respectively.
The statistical pooling layer receives as input the output of the final frame level layer, accumulates the data over the time of the input segment t=30 s, and calculates its mean and standard deviation. The statistics are 1500 dimension vectors, calculated once per input segment. These statistics are then passed to two additional hidden layers, the resulting embedded dimensions are 512 and 300, respectively, and finally through the SoftMax output layer. After training is completed, a SoftMax output layer is not needed, and a total of 420 ten thousand parameters are contained except for the SoftMax output layer.
Training the model using multi-class cross entropy functions, classifying speakers using variable length speech segments in the dataset, assuming K speakers, N speech segment samples for training,indicating the probability that a speech segment is of the kth speaker during the T period, spkr k Represents the kth speaker, +.>Represents a speech segment within a period of T, N represents a speech segment of N speech segments, d n,k Representing a binary function, wherein the objective function takes a value of 1, and if the nth speech segment belongs to the kth speaker, the objective function E of classification is:
the optimization of this objective function was performed using a random gradient descent (SGD) algorithm, with the miniband size set to 64 and the initial learning rate set to 0.008. Specifically, 64000 samples of the sample set are divided into 1000 subsets, each subset has 64 samples, the 1000 subsets are circularly traversed, a gradient descent update parameter is performed for each subset, and after all miniband is traversed, 1000 iterations are performed in gradient descent.
In voiceprint recognition, a large part of the error of the model comes from the difference in the channels of the speech segments, so that channel compensation is required for the speech segments. In this embodiment, after the network model based on DNN is trained, after the SoftMax layer of the model is removed, the output embedded vector is the feature vector of the voice of the corresponding speaker, and then the model is connected to the rear end of the PLDA model, so as to implement channel compensation for the voice segment.
The judging module 120 is configured to judge whether a pre-stored voice feature of the speaker exists in a pre-established voiceprint library according to a preset judging rule based on the real-time voice feature of the speaker, and when the voiceprint library exists in the pre-stored voice feature of the speaker, read the pre-stored voice feature of the speaker and a tag corresponding to the speaker from the voiceprint library.
In this embodiment, based on the real-time voice feature of the speaker, whether the voice feature of the speaker exists in a pre-established voiceprint library is determined by using a preset determination rule, and when the voiceprint library exists in the pre-stored voice feature of the speaker, the voice feature of the speaker and the label corresponding to the speaker are extracted. The pre-established voiceprint library can contain audio data, labels and characteristic vectors generated after passing through an x-vector network of conference reports and lectures of members of the company, particularly company leaders. And generating a feature vector from the detected voice information of the speaker in the conference real record by utilizing a voiceprint recognition model, and then performing scoring comparison with each voiceprint feature in the voiceprint library to judge whether the detected voice information exists in the voiceprint library.
Further, a first similarity value of the real-time voice feature of the speaker and each pre-stored voice feature in the voiceprint library is calculated respectively by using a first preset calculation rule, and when the first similarity value is greater than or equal to a preset threshold value, the pre-stored voice feature of the speaker and the label corresponding to the speaker are read and determined from the voiceprint library.
Specifically, the similarity of the features of the speaker to the features of the voices in the voiceprint library may be compared using a PLDA model, which is a model containing four variables, the jth voice of the ith speaker being representable as:
X ij =μ+Fh i +Gw ij +ξ ij ,
where μ is the training data mean, matrix F represents speaker subspace, G represents scene subspace, vector h i And w ij For the corresponding subspace factors, they obey a standard gaussian distribution, ζ ij Representing the residual. The first two terms in the above expression relate to the speaker only and do not relate to a specific voice of the speaker, and are called signal parts, describing differences between speakers, and the second two terms describe differences between different scenes of the same speaker, and are noise parts. h is a i Can be regarded as X ij Feature representation in speaker space, if h of two voices in scoring stage i The greater the likelihood of a feature, the greater the probability that two voices belong to the same speaker, wherein the log likelihood score may be used to score:
wherein H is s And H d Representing two different speech spaces, eta 1 ,η 2 Representing different speech features, p is the probability that two voices come from the same feature space, and the greater the score value of the two voices is, the higher the similarity is, and the greater the probability of belonging to the same person is.
In one embodiment, when all the first similarity values are smaller than a preset threshold, real-time voice information, tags and voiceprint features of the speaker are stored in the voiceprint library.
And the execution module 130 is configured to obtain a microphone corresponding to the real-time voice feature of the speaker based on a pre-established mapping relationship between each microphone and each speaker tag, detect whether a crosstalk phenomenon occurs in a plurality of microphones in real time, and execute a crosstalk prevention processing operation on the microphone having the crosstalk when the crosstalk phenomenon occurs in any one of the plurality of microphones.
In this embodiment, the voice print feature of the speaker and the tag corresponding to the voice print feature may be obtained by comparing the detected voice with the voice print library in the conference real record, and the mapping relationship between the microphone and the speaker tag may be established in advance in the conference real record system, and the unique connection between the speaker voice print feature and the microphone may be established through the speaker tag based on the mapping relationship. And detecting whether crosstalk exists among a plurality of microphones in the conference in real time, and when the crosstalk exists among the microphones, determining the microphone corresponding to the speaker only by extracting the voiceprint characteristics of the speaker, and executing crosstalk prevention processing operations such as closing, sensitivity reduction and the like on the other microphones with the crosstalk.
The voiceprint characteristics of the speaker are unique, and since one speaker tag may be associated with a plurality of microphones, when microphones are crosstalked, particularly when the characteristics of the intensity of an incoming audio stream in the microphones are similar, the microphone where crosstalk occurs cannot be determined by the mapping relationship between the speaker tag and the microphones, and thus, automatic crosstalk prevention processing cannot be performed.
Further, real-time voice information of the speaker is converted into text information in real time based on a preset conversion rule, the number of microphones in the response is determined based on the label of the speaker corresponding to the converted text information, and when the number of microphones is smaller than a preset value, the crosstalk prevention processing operation is not executed.
The conference real recording system can convert the voice of a speaker into text information in real time by using an ASR technology, detect microphone crosstalk phenomenon by converting the text information after voice conversion, firstly determine the number of responding microphones according to the speaker labels corresponding to the text after detection conversion, and do not need crosstalk prevention when only one or no microphone responds.
In one embodiment, when the number of microphones is greater than a preset value, a second similarity value between text information corresponding to each microphone is calculated by using a second preset calculation rule, and when the second similarity values are both greater than or equal to a second preset threshold value, a crosstalk prevention processing operation is performed on the microphones with crosstalk, otherwise, the crosstalk prevention processing operation is not performed.
When two or more microphones respond, judging whether the output text content is consistent, if not, indicating that two or more speakers use the microphones and do not belong to crosstalk phenomenon, and no crosstalk prevention processing is needed; if the output text content is consistent, the crosstalk phenomenon of the microphone is indicated, and crosstalk prevention processing is required. Wherein, determining whether the output text content is consistent may include: whether the Jaccard similarity value of the output text corresponding to the different speaker tags is larger than a second preset threshold value of 0.9. In the crosstalk processing, the microphone used by the speaker and the microphone generating crosstalk are required to be distinguished, at this time, the voice of the speaker can be detected, the characteristic vector after the x-vector processing is extracted, the microphone matched with the speaker is determined by using the characteristic vector, and the unmatched microphone is turned off.
Referring to fig. 3, a flow chart of a crosstalk prevention method based on a real-meeting recording system according to a preferred embodiment of the present invention is shown.
Step S10, obtaining the voice information of the speaker in real time, and inputting the voice information into a pre-trained voiceprint recognition model to obtain the real-time voice characteristics of the speaker.
In this embodiment, the real-time voice information of the speaker in the conference may be acquired in real time by using a sound collecting device such as a terminal device (e.g., a microphone) having a recording function or a video recording device (e.g., a digital video camera) having a video recording function. The audio format of the voice information may be mp3, wma, wav, etc. Specifically, when a speaker at one side of the terminal device starts speaking, the terminal device collects voice information through the voice collection device. In addition, the voice end point detection technology can be used for distinguishing the voice signals and the non-voice signals in the voice of the speaker, removing invalid voice fragments and noise, determining the starting end point and the ending end point of each valid voice fragment, and improving the matching accuracy of the follow-up voice and the offline database. After the voice information of the speaker is obtained, the voice information is input into a pre-trained voiceprint recognition model, and voiceprint characteristics of the voice information can be obtained.
The training step of the voiceprint recognition model comprises the following steps:
a predetermined number of voice messages are obtained from a predetermined voice database (e.g., NIST-SREs), for example, about 6 ten thousand 4 thousand recorded data from 4400 speakers in 2004 through 2010 and a member conference report of the company and lecture audio material. The deep neural network of the x-vector model is trained by using the acquired voice data, so that network parameters which can correctly distinguish voiceprints of different speakers in a training set are trained, and the capability of effectively identifying voiceprint features of speakers outside the training set is improved. Among the available deep neural network models include, but are not limited to, the following: feedforward DNN, CNN, LSTM, transformer.
In this embodiment, the deep neural network model is described by taking feedforward DNN as an example, and includes a voice mel-frequency cepstral coefficients (MCFFs) feature input layer, four NIN (network-in-network) hidden layers processed at the frame level, a statistics pooling layer, two embedded representation layers, and a last SoftMax output layer.
The input data of the input layer are processed MFCCs eigenvectors, MCFFs are cepstrum parameters extracted in the Mel scale frequency domain, the Mel scale describes the nonlinear characteristic of the human ear perception sound frequency, and the relation between the input data and the frequency can be expressed by the following formula:
Where f represents the speech frequency.
The basic operation flow for extracting the MFCCs feature vector comprises the following steps: continuous speech is input, pre-emphasis, framing, windowing, FFT conversion of the signal, passing through a Mel filter bank, logarithmic operation, and dynamic differential parameter extraction. The purpose of pre-emphasis is to boost the high frequency part, flatten the spectrum of the signal, remain in the whole frequency band from low frequency to high frequency, and can use the same signal-to-noise ratio to find the spectrum. Meanwhile, in order to eliminate vocal cords and lip effects in the sounding process, high-frequency parts of the voice signals, which are restrained by the sounding system, are compensated, and high-frequency formants are highlighted. N sampling points are integrated into one observation unit, which is called a frame. Typically, N has a value of 256 or 512, covering a period of about 20-30 ms. Each frame is multiplied by a hamming window to increase the continuity at the left and right ends of the frame. Each frame is then converted into an energy distribution in the frequency domain for observation, and speech features in different energy distributions are extracted.
In this embodiment, n=512 is adopted, the time of one frame is 25ms, the time of the sliding window is 3s, 20-dimensional speech features of MFCCs are extracted for each frame, non-speech frames are filtered through the VAD based on energy, each sliding window reads in an input speech feature vector containing 120 dimensions, and the output dimension is 512 dimensions.
The NIN hidden layer is composed of a plurality of micro-network modules, the parameters of the micro-network modules are shared, the number of network parameters to be trained by the model is reduced, and the hidden layers are connected through a ReLU nonlinear activation function.
In this embodiment, feature vectors at the current time t and the front and rear time are continuously transferred into the input data layer, which includes data in { t-2, t-1, t, t+1, t+2}5 windows, the input dimension of each window data is 120-dimensional, the output is 512-dimensional, { t-2, t, t+2}, { t-3, t, t+3} window data are bonded together as the input data dimension of the first hidden layer and the second hidden layer respectively, 1536 is the output dimension, 512 is the output dimension, the input data of the third and fourth NIN hidden layer are all the current window { t } data, and 512 and 1500 are the output dimensions respectively.
The statistical pooling layer receives as input the output of the final frame level layer, accumulates the data over the time of the input segment t=30 s, and calculates its mean and standard deviation. The statistics are 1500 dimension vectors, calculated once per input segment. These statistics are then passed to two additional hidden layers, the resulting embedded dimensions are 512 and 300, respectively, and finally through the SoftMax output layer. After training is completed, a SoftMax output layer is not needed, and a total of 420 ten thousand parameters are contained except for the SoftMax output layer.
Training the model using multi-class cross entropy functions, classifying speakers using variable length speech segments in the dataset, assuming K speakers, N speech segment samples for training,indicating the probability that a speech segment is of the kth speaker during the T period, spkr k Represents the kth speaker, +.>Represents a speech segment within a period of T, N represents a speech segment of N speech segments, d n,k Representing a binary function, wherein the objective function takes a value of 1, and if the nth speech segment belongs to the kth speaker, the objective function E of classification is:
the optimization of this objective function was performed using a random gradient descent (SGD) algorithm, with the miniband size set to 64 and the initial learning rate set to 0.008. Specifically, 64000 samples of the sample set are divided into 1000 subsets, each subset has 64 samples, the 1000 subsets are circularly traversed, a gradient descent update parameter is performed for each subset, and after all miniband is traversed, 1000 iterations are performed in gradient descent.
In voiceprint recognition, a large part of the error of the model comes from the difference in the channels of the speech segments, so that channel compensation is required for the speech segments. In this embodiment, after the network model based on DNN is trained, after the SoftMax layer of the model is removed, the output embedded vector is the feature vector of the voice of the corresponding speaker, and then the model is connected to the rear end of the PLDA model, so as to implement channel compensation for the voice segment.
Step S20, based on the real-time voice characteristics of the speaker, judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library by utilizing a preset judging rule, and reading the pre-stored voice characteristics of the speaker and the labels corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker.
In this embodiment, based on the real-time voice feature of the speaker, whether the voice feature of the speaker exists in a pre-established voiceprint library is determined by using a preset determination rule, and when the voiceprint library exists in the pre-stored voice feature of the speaker, the voice feature of the speaker and the label corresponding to the speaker are extracted. The pre-established voiceprint library can contain audio data, labels and characteristic vectors generated after passing through an x-vector network of conference reports and lectures of members of the company, particularly company leaders. And generating a feature vector from the detected voice information of the speaker in the conference real record by utilizing a voiceprint recognition model, and then performing scoring comparison with each voiceprint feature in the voiceprint library to judge whether the detected voice information exists in the voiceprint library.
Further, a first similarity value of the real-time voice feature of the speaker and each pre-stored voice feature in the voiceprint library is calculated respectively by using a first preset calculation rule, and when the first similarity value is greater than or equal to a preset threshold value, the pre-stored voice feature of the speaker and the label corresponding to the speaker are read and determined from the voiceprint library.
Specifically, the similarity of the features of the speaker to the features of the voices in the voiceprint library may be compared using a PLDA model, which is a model containing four variables, the jth voice of the ith speaker being representable as:
X ij =μ+Fh i +gw ij +ξ ij ,
where μ is the training data mean, matrix F represents speaker subspace, G represents scene subspace, vector h i And w ij For the corresponding subspace factors, they obey a standard gaussian distribution, ζ ij Representing the residual. The first two terms in the above expression relate to the speaker only and do not relate to a specific voice of the speaker, and are called signal parts, describing differences between speakers, and the second two terms describe differences between different scenes of the same speaker, and are noise parts. h is a i Can be regarded as X ij Feature representation in speaker space, if h of two voices in scoring stage i The greater the likelihood of a feature, the greater the probability that two voices belong to the same speaker, wherein the log likelihood score may be used to score:
wherein H is s And H d Representing two different speech spaces, eta 1 ,η 2 Representing different speech features, p is the probability that two voices come from the same feature space, and the greater the score value of the two voices is, the higher the similarity is, and the greater the probability of belonging to the same person is.
In one embodiment, when all the first similarity values are smaller than a preset threshold, storing the real-time voice information, the tag and the voiceprint feature of the speaker into the voiceprint library.
Step S30, based on the pre-established mapping relation between each microphone and each speaker tag, obtaining the microphone corresponding to the real-time voice feature of the speaker, detecting whether the crosstalk phenomenon occurs in the plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphone with crosstalk when the crosstalk phenomenon occurs in any one of the plurality of microphones.
In this embodiment, the voice print feature of the speaker and the tag corresponding to the voice print feature may be obtained by comparing the detected voice with the voice print library in the conference real record, and the mapping relationship between the microphone and the speaker tag may be established in advance in the conference real record system, and the unique connection between the speaker voice print feature and the microphone may be established through the speaker tag based on the mapping relationship. And detecting whether crosstalk exists among a plurality of microphones in the conference in real time, and when the crosstalk exists among the microphones, determining the microphone corresponding to the speaker only by extracting the voiceprint characteristics of the speaker, and executing crosstalk prevention processing operations such as closing, sensitivity reduction and the like on the other microphones with the crosstalk.
The voiceprint characteristics of the speaker are unique, and since one speaker tag may be associated with a plurality of microphones, when microphones are crosstalked, particularly when the characteristics of the intensity of an incoming audio stream in the microphones are similar, the microphone where crosstalk occurs cannot be determined by the mapping relationship between the speaker tag and the microphones, and thus, automatic crosstalk prevention processing cannot be performed.
Further, real-time voice information of the speaker is converted into text information in real time based on a preset conversion rule, the number of microphones in the response is determined based on the label of the speaker corresponding to the converted text information, and when the number of microphones is smaller than a preset value, the crosstalk prevention processing operation is not executed.
The conference real recording system can convert the voice of a speaker into text information in real time by using an ASR technology, detect microphone crosstalk phenomenon by converting the text information after voice conversion, firstly determine the number of responding microphones according to the speaker labels corresponding to the text after detection conversion, and do not need crosstalk prevention when only one or no microphone responds.
In one embodiment, when the number of microphones is greater than a preset value, a second similarity value between text information corresponding to each microphone is calculated by using a second preset calculation rule, and when the second similarity values are both greater than or equal to a second preset threshold value, a crosstalk prevention processing operation is performed on the microphones with crosstalk, otherwise, the crosstalk prevention processing operation is not performed.
When two or more microphones respond, judging whether the output text content is consistent, if not, indicating that two or more speakers use the microphones and do not belong to crosstalk phenomenon, and no crosstalk prevention processing is needed; if the output text content is consistent, the crosstalk phenomenon of the microphone is indicated, and crosstalk prevention processing is required. Wherein, determining whether the output text content is consistent may include: whether the Jaccard similarity value of the output text corresponding to the different speaker tags is larger than a second preset threshold value of 0.9. In the crosstalk processing, the microphone used by the speaker and the microphone generating crosstalk are required to be distinguished, at this time, the voice of the speaker can be detected, the characteristic vector after the x-vector processing is extracted, the microphone matched with the speaker is determined by using the characteristic vector, and the unmatched microphone is turned off.
Furthermore, the present invention also proposes a computer-readable storage medium, which may be any one or any combination of several of a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disc read-only memory (CD-ROM), a USB memory, and the like. The computer readable storage medium includes a crosstalk prevention program 10 based on a real meeting recording system, and the crosstalk prevention program 10 based on the real meeting recording system realizes the following operations when being executed by a processor:
The acquisition step: acquiring voice information of a speaker in real time, inputting the voice information into a pre-trained voiceprint recognition model, and obtaining real-time voice characteristics of the speaker;
judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice characteristic of the speaker, detecting whether a crosstalk phenomenon occurs in a plurality of microphones in real time, and executing crosstalk prevention processing operation on the microphone with crosstalk when any microphone in the plurality of microphones has the crosstalk phenomenon.
The embodiment of the computer readable storage medium of the present invention is substantially the same as the embodiment of the crosstalk prevention method based on the real recording system, and will not be described herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
Claims (6)
1. A crosstalk prevention method based on a conference real recording system, applied to an electronic device, characterized in that the method comprises the following steps:
The acquisition step: acquiring voice information of a speaker in a conference in real time, and inputting the voice information into a pre-trained voiceprint recognition model to obtain real-time voice characteristics of the speaker;
judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice feature of the speaker, and detecting whether crosstalk occurs among a plurality of microphones in the conference in real time, including: and converting real-time voice information of a speaker into text information based on a preset conversion rule, determining speaker tags corresponding to the converted text information based on the mapping relation to determine the number of microphones in response, judging that no crosstalk phenomenon occurs when the number of the microphones is smaller than a preset value, and not executing crosstalk prevention processing operation, or respectively calculating second similarity values among the text information corresponding to each microphone by using a second preset calculation rule when the number of the microphones is larger than the preset value, judging that the crosstalk phenomenon occurs when the second similarity values are both larger than or equal to a second preset threshold value, and executing the crosstalk prevention processing operation on the microphones with the crosstalk, otherwise, not executing the crosstalk prevention processing operation.
2. The crosstalk prevention method based on a meeting real recording system according to claim 1, wherein the judging step includes:
and respectively calculating a first similarity value of the real-time voice characteristic of the speaker and each pre-stored voice characteristic in the voiceprint library by using a first preset calculation rule, and determining the pre-stored voice characteristic of the speaker and the label corresponding to the speaker from the voiceprint library when the first similarity value is greater than or equal to a preset threshold value.
3. The crosstalk prevention method based on a meeting real recording system according to claim 2, wherein the judging step further comprises:
and when all the first similarity values are smaller than a preset threshold value, storing the real-time voice information, the labels and the voiceprint features of the speaker into the voiceprint library.
4. An electronic device, the electronic device includes the memorizer and processor, characterized by that, store the program of preventing crosstalk based on real recording system of meeting on the said memorizer, the said program of preventing crosstalk based on real recording system of meeting is carried out by the said processor, realize the following steps:
the acquisition step: acquiring voice information of a speaker in a conference in real time, and inputting the voice information into a pre-trained voiceprint recognition model to obtain real-time voice characteristics of the speaker;
Judging: judging whether the pre-stored voice characteristics of the speaker exist in a pre-established voiceprint library or not by utilizing a preset judging rule based on the real-time voice characteristics of the speaker, and reading the pre-stored voice characteristics of the speaker and the label corresponding to the speaker from the voiceprint library when the voiceprint library exists in the pre-stored voice characteristics of the speaker; and
The method comprises the following steps: based on a pre-established mapping relation between each microphone and each speaker tag, obtaining a microphone corresponding to the real-time voice feature of the speaker, and detecting whether crosstalk occurs among a plurality of microphones in the conference in real time, including: and converting real-time voice information of a speaker into text information based on a preset conversion rule, determining speaker tags corresponding to the converted text information based on the mapping relation to determine the number of microphones in response, judging that no crosstalk phenomenon occurs when the number of the microphones is smaller than a preset value, and not executing crosstalk prevention processing operation, or respectively calculating second similarity values among the text information corresponding to each microphone by using a second preset calculation rule when the number of the microphones is larger than the preset value, judging that the crosstalk phenomenon occurs when the second similarity values are both larger than or equal to a second preset threshold value, and executing the crosstalk prevention processing operation on the microphones with the crosstalk, otherwise, not executing the crosstalk prevention processing operation.
5. The electronic device of claim 4, wherein the determining step comprises:
and respectively calculating a first similarity value of the real-time voice characteristic of the speaker and each pre-stored voice characteristic in the voiceprint library by using a first preset calculation rule, and determining the pre-stored voice characteristic of the speaker and the label corresponding to the speaker from the voiceprint library when the first similarity value is greater than or equal to a preset threshold value.
6. A computer readable storage medium, wherein the computer readable storage medium includes a crosstalk prevention program based on a real conference recording system, and when the crosstalk prevention program based on the real conference recording system is executed by a processor, the steps of the crosstalk prevention method based on the real conference recording system according to any one of claims 1 to 3 are implemented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235796.4A CN111429919B (en) | 2020-03-30 | 2020-03-30 | Crosstalk prevention method based on conference real recording system, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235796.4A CN111429919B (en) | 2020-03-30 | 2020-03-30 | Crosstalk prevention method based on conference real recording system, electronic device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111429919A CN111429919A (en) | 2020-07-17 |
CN111429919B true CN111429919B (en) | 2023-05-02 |
Family
ID=71551659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010235796.4A Active CN111429919B (en) | 2020-03-30 | 2020-03-30 | Crosstalk prevention method based on conference real recording system, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111429919B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064994A (en) * | 2021-03-25 | 2021-07-02 | 平安银行股份有限公司 | Conference quality evaluation method, device, equipment and storage medium |
CN113345466B (en) * | 2021-06-01 | 2024-03-01 | 平安科技(深圳)有限公司 | Main speaker voice detection method, device and equipment based on multi-microphone scene |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675889A (en) * | 2018-07-03 | 2020-01-10 | 阿里巴巴集团控股有限公司 | Audio signal processing method, client and electronic equipment |
CN110718238B (en) * | 2018-07-12 | 2023-08-18 | 阿里巴巴集团控股有限公司 | Crosstalk data detection method, client and electronic equipment |
CN109388701A (en) * | 2018-08-17 | 2019-02-26 | 深圳壹账通智能科技有限公司 | Minutes generation method, device, equipment and computer storage medium |
CN110049270B (en) * | 2019-03-12 | 2023-05-30 | 平安科技(深圳)有限公司 | Multi-person conference voice transcription method, device, system, equipment and storage medium |
-
2020
- 2020-03-30 CN CN202010235796.4A patent/CN111429919B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111429919A (en) | 2020-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN108198547B (en) | Voice endpoint detection method and device, computer equipment and storage medium | |
US9875739B2 (en) | Speaker separation in diarization | |
US20200074997A1 (en) | Method and system for detecting voice activity in noisy conditions | |
Mantena et al. | Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping | |
Das et al. | Recognition of isolated words using features based on LPC, MFCC, ZCR and STE, with neural network classifiers | |
US8275616B2 (en) | System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands | |
US8271283B2 (en) | Method and apparatus for recognizing speech by measuring confidence levels of respective frames | |
Hanilçi et al. | Source cell-phone recognition from recorded speech using non-speech segments | |
US11100932B2 (en) | Robust start-end point detection algorithm using neural network | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
CN111429919B (en) | Crosstalk prevention method based on conference real recording system, electronic device and storage medium | |
Barakat et al. | Keyword spotting based on the analysis of template matching distances | |
CN110782902A (en) | Audio data determination method, apparatus, device and medium | |
CN112599152A (en) | Voice data labeling method, system, electronic equipment and storage medium | |
Salekin et al. | Distant emotion recognition | |
Pao et al. | A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition | |
Wöllmer et al. | Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting | |
Unnibhavi et al. | LPC based speech recognition for Kannada vowels | |
KR102113879B1 (en) | The method and apparatus for recognizing speaker's voice by using reference database | |
Tawaqal et al. | Recognizing five major dialects in Indonesia based on MFCC and DRNN | |
KR101023211B1 (en) | Microphone array based speech recognition system and target speech extraction method of the system | |
CN116386633A (en) | Intelligent terminal equipment control method and system suitable for noise condition | |
US20180268815A1 (en) | Quality feedback on user-recorded keywords for automatic speech recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |