WO2020011085A1 - 串音数据检测方法和电子设备 - Google Patents

串音数据检测方法和电子设备 Download PDF

Info

Publication number
WO2020011085A1
WO2020011085A1 PCT/CN2019/094530 CN2019094530W WO2020011085A1 WO 2020011085 A1 WO2020011085 A1 WO 2020011085A1 CN 2019094530 W CN2019094530 W CN 2019094530W WO 2020011085 A1 WO2020011085 A1 WO 2020011085A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
data block
time difference
segment
audio
Prior art date
Application number
PCT/CN2019/094530
Other languages
English (en)
French (fr)
Inventor
许云峰
余涛
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to JP2021500297A priority Critical patent/JP2021531685A/ja
Publication of WO2020011085A1 publication Critical patent/WO2020011085A1/zh
Priority to US17/111,341 priority patent/US11551706B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/002Applications of echo suppressors or cancellers in telephonic connections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/004Monitoring arrangements; Testing arrangements for microphones
    • H04R29/005Microphone arrays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B3/00Line transmission systems
    • H04B3/02Details
    • H04B3/46Monitoring; Testing
    • H04B3/487Testing crosstalk effects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/26Arrangements for supervision, monitoring or testing with means for applying test signals or for measuring
    • H04M3/34Testing for cross-talk
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2227/00Details of public address [PA] systems covered by H04R27/00 but not provided for in any of its subgroups
    • H04R2227/003Digital PA systems using, e.g. LAN or internet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2227/00Details of public address [PA] systems covered by H04R27/00 but not provided for in any of its subgroups
    • H04R2227/009Signal processing in [PA] systems to enhance the speech intelligibility

Definitions

  • This specification relates to the field of computer technology, and particularly to a crosstalk data detection method and an electronic device.
  • a microphone can be used to amplify the sound source, and multiple microphones on site can collect audio data for each character.
  • crosstalk may occur when more than two microphones are in close proximity.
  • Embodiments of the present specification provide a crosstalk data detection method and an electronic device that can detect crosstalk data.
  • An embodiment of the present specification provides a crosstalk data detection method, including: receiving a first audio data block and a second audio data block; wherein the first audio data block and the second audio data block each include a plurality of audio data. Data segmentation; calculating a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block to obtain a peak value of the correlation coefficient; The time difference between the acquisition time of the audio data segment in the data block and the acquisition time of the audio data segment in the second audio data block is used as the reference time difference; the audio data segment in the first audio data block corresponds to the second audio data block The time difference of the acquisition time of the audio data segment of the video is used as the audio segment time difference. In a case where the audio segment time difference does not match the reference time difference, determining that the corresponding audio data segment of the first audio data block includes Crosstalk data.
  • An embodiment of the present specification provides an electronic device including: a first sound sensing device for generating a first audio data block; the first audio data block includes a plurality of audio data segments; and a second sound sensing device for Generating a second audio data block; the second audio data block includes a plurality of audio data segments; a processor, configured to calculate the audio data segments of the first audio data block and the audio of the second audio data block The correlation coefficient of the data segment to obtain the peak value of the correlation coefficient; the time difference between the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak As the reference time difference; using the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block as the audio segment time difference; When the reference time difference does not match, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification provides a crosstalk data detection method, including: receiving a first audio data block and a second audio data block; wherein the first audio data block and the second audio data block each include a plurality of audio data. Data segmentation; calculating a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block to obtain a peak value of the correlation coefficient; The time difference between the acquisition time of the audio data segment in the data block and the audio data segment in the second audio data block is used as the reference time difference; the reference time difference, the first audio data block, and the second audio data block are used as the reference time difference; Sending to the server, for the server to use the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block as the audio segment time difference; When the audio segment time difference does not match the reference time difference, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification provides a method for detecting crosstalk data, including: receiving a first audio data block, a second audio data block, and a reference time difference; wherein the first audio data block and the second audio data block include: Multiple audio data segments; the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block is used as the audio segment time difference; in the audio segment In a case where the time difference does not match the reference time difference, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification provides a crosstalk data detection method, including: receiving a first audio data block and a second audio data block; wherein the first audio data block and the second audio data block each include a plurality of audio data. Data segmentation; calculating a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block to obtain a peak value of the correlation coefficient; combining the peak value, the first audio data block, and Sending the second audio data block to the server, for the server to obtain the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak
  • the time difference is used as the reference time difference; the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block is used as the audio segment time difference; When the reference time difference does not match, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification provides a crosstalk data detection method, including: receiving a peak of a correlation coefficient, a first audio data block, and a second audio data block provided by a client; wherein the peak is the first audio data block The peak value of the correlation coefficient between the audio data segment of the audio data segment and the audio data segment of the second audio data block; the audio data segment in the first audio data block corresponding to the peak value and the second audio data block The time difference between the acquisition time of the audio data segment of the audio data segment is used as the reference time difference; the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block is used as the audio segment time difference In the case where the audio segment time difference does not match the reference time difference, determining that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification provides a crosstalk data detection method, including: receiving a first audio data block and a second audio data block; wherein the first audio data block and the second audio data block each include a plurality of audio data. Data segmentation; sending the first audio data block and the second audio data block to a server, for the server to calculate an audio data segment of the first audio data block and a second audio data block The correlation coefficient of the audio data segments to obtain the peak value of the correlation coefficient; the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak The time difference is used as the reference time difference; the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block is used as the audio segment time difference; When the reference time difference does not match, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • FIG. 1 is a schematic diagram of a crosstalk data detection system provided by an embodiment of the present specification
  • FIG. 2 is a schematic diagram of an application scenario of a crosstalk data detection system provided in an embodiment of the present specification in a debate competition scenario;
  • FIG. 3 is a schematic diagram of an audio data block transmission path provided by an embodiment of the present specification.
  • FIG. 4 is a schematic block diagram of a crosstalk data detection system provided by an embodiment of the present specification.
  • FIG. 5 is a schematic block diagram of a crosstalk data detection system provided by an embodiment of the present specification.
  • FIG. 6 is a schematic block diagram of a crosstalk data detection system provided by an embodiment of the present specification.
  • Fig. 1 and Fig. 2 Please refer to Fig. 1 and Fig. 2.
  • there are 4 detractors on the pros and cons each sitting on a long table on each side, and two microphones are placed on each long table for The sound from the debater is sensed, and the sound sensed by the microphone is amplified by the amplifier.
  • an electronic device may be provided.
  • the electronic device may receive the audio data stream generated by the microphone A and the microphone B through a receiving module, and process the audio data stream to detect the audio data stream.
  • Crosstalk data may be provided.
  • the electronic device receives the sound sensed by the microphone A and generates an audio data stream. At the same time, since the microphone B can also generate an audio data stream according to the sensed sound.
  • the receiving module may have multiple data channels corresponding to the number of microphones. Microphone A corresponds to data channel A, and microphone B corresponds to data channel B. In this scenario example, there may be a total of 8 microphones, and the electronic device may have 8 data channels. Further, the electronic device can receive the audio data stream input by the microphone in the data channel through WIFI.
  • the receiving module may divide the audio data stream into audio data blocks. Specifically, the audio data stream in the data channel A may be divided into a first audio data block, and the audio data stream in the data channel B may be divided into a second audio data block.
  • the electronic device may target the audio data stream input by the data channel A as a target, and detect the audio data stream in the data channel A according to the association between the audio channel in the data channel A and the data channel B. Whether crosstalk data exists in.
  • the first audio data block and the second audio data block may be divided into several audio data segments in units of 1000 ms.
  • the coefficient calculation module of the electronic device may perform a Fourier transform on the first audio data block and the second audio data block respectively, and according to the Fourier transformed first audio data block and the third audio data block, Two audio data blocks generate cross-correlation functions.
  • the correlation between the audio data segment in the first audio data block and the audio data segment in the second audio data block may be calculated according to the cross-correlation function.
  • the audio data segments in the first audio data block and the audio data segments in the second audio data block may be separately calculated for the correlation degree, and the maximum value in the calculated correlation degree may be regarded as the maximum value
  • the audio data segment in the corresponding second audio data block corresponds to the audio data segment in the first audio data block. In this way, the maximum value can be regarded as the final correlation coefficient of the audio data segment in the first audio data block.
  • the correlation coefficient corresponding to the audio data segment in the audio data block can be obtained according to the cross-correlation function calculation. Because two people tend to speak at the same time, the correlation coefficient corresponding to the audio data segment in the audio data block can have two peaks, 0.3 and 0.5, respectively. 0.3 may be determined as the first correlation coefficient, and 0.5 may be determined as the second correlation coefficient.
  • a threshold can be set, and the audio data segments are filtered according to the threshold to obtain valid data in the audio data block.
  • the threshold may be 0.1.
  • the correlation coefficient is greater than 0.1, it is considered that between the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the correlation coefficient. Similarity is high. It can be considered that the two audio data segments may originate from the same sound source, that is, they can be regarded as valid data.
  • the correlation coefficient is less than 0.1, it is considered that the degree of similarity between the audio data segments in the first audio data block and the second audio data block corresponding to the correlation coefficient is low, and the first audio data corresponding to the correlation coefficient may be considered Audio data segments in a block may be noise. In this scenario example, crosstalk detection may not be performed on audio data segments that are considered noise. Since the first correlation coefficient and the second correlation coefficient are 0.3 and 0.4, both are greater than 0.1, it can be considered that the audio data segment corresponding to the first correlation coefficient and the second correlation coefficient is valid data.
  • the coefficient calculation module may determine the audio data segment in the first audio data block corresponding to the first correlation coefficient as the first target audio data segment, and determine the first audio data corresponding to the second correlation coefficient.
  • the audio data segment in the block is determined as the second target audio data segment.
  • the audio data segment in the second audio data block corresponding to the first correlation coefficient is determined as the first auxiliary audio data segment, and the audio data segment in the second audio data block corresponding to the second correlation coefficient is determined as the second Secondary audio data segmentation.
  • the time difference determination module of the electronic device may calculate a first time difference between the first target audio data segment and the first auxiliary audio data segment.
  • the first time difference may be 30 ms.
  • a second time difference between the second target audio data segment and the second auxiliary audio data segment is calculated.
  • the second time difference may be 60 ms.
  • the time difference determination module may determine the smaller one of the first time difference and the second time difference as the reference time difference and the other as the crosstalk time difference. That is, the reference time difference may be determined as 30 ms, and the crosstalk time difference may be determined as 60 ms.
  • the processing module of the electronic device determines an audio data segment in a second audio data block corresponding to each audio data segment in the first audio data block according to a correlation coefficient, and further calculates a first An audio segment time difference between an audio data segment in an audio data block and a corresponding audio data segment in a second audio data block.
  • the audio segment time difference corresponding to the audio data segment in the first audio data block is equal to 30ms
  • it is determined that the audio data segment in the first audio data block is the main audio data
  • the audio segment time difference is equal to
  • the audio data segment in the first audio data block is crosstalk data.
  • the electronic device may generate a first audio data block and a second audio data block according to the audio data streams input from the data channel A and the data channel B.
  • the electronic device may calculate a correlation coefficient between the audio data segment in the first audio data block and the audio data segment in the second audio data block.
  • the audio data segments in the first audio data block are filtered according to the correlation coefficient, and it is found that there are 150 audio data segments in the first audio data block that are valid data. Further, the electronic device obtains a correlation coefficient between the first audio data block and the second audio data block, and there is a peak value 0.4, and the time difference corresponding to the correlation coefficient peak value 0.4 is 50 ms.
  • the electronic device calculates the smoothing energy of each audio data segment in the first audio data block and the second audio data block, respectively, and the smoothing energy of the audio data segment in the first audio data block is greater than the second energy data segment.
  • the statistical number is 5.
  • a smoothing energy of the audio data segment in the first audio data block segment may be set to be greater than a ratio of an amount of smoothing energy of the audio data segment in the second audio data block to a quantity of valid data greater than
  • the time difference corresponding to the peak of the correlation coefficient is determined as the reference time difference.
  • the time difference corresponding to the peak of the correlation coefficient is determined to be the crosstalk time difference. Since the ratio of 5 and 150 is less than 0.2, the time difference determined as 50ms is the crosstalk time difference.
  • the electronic device calculates a time difference corresponding to the audio data segment of the first audio data block, and determines that the corresponding voice data is crosstalk data when the calculated time difference is equal to 50 ms.
  • the detected crosstalk data can be further removed, and the audio data block after the crosstalk removal is stored in a specified audio file to generate a clearer debate record.
  • the crosstalk data detection system may include a receiving module, a coefficient calculation module, a time difference determination module, and a processing module.
  • the following describes the crosstalk data detection system with functional modules.
  • the crosstalk data detection method is implemented.
  • the crosstalk data detection method can be understood with reference to the following functional modules, and will not be described again.
  • the receiving module may receive a first audio data block and a second audio data block; wherein the first audio data block and the second audio data block include a plurality of audio data segments, respectively.
  • the receiving module may receive a first audio data block input from a first data channel and a second audio data block input from a second data channel.
  • the receiving module may be a receiving device or a communication module having a data interaction capability.
  • the receiving module may receive the first audio data block input by the first data channel and the second audio data block input by the second data channel in a wired manner.
  • the first audio data block and the first data channel input from the first data channel may also be received based on a network protocol such as HTTP, TCP / IP, or FTP, or through a wireless communication module such as a WIFI module, ZigBee module, Bluetooth module, and Z-wave module.
  • the receiving module can also be referred to as a software program interface, which can be run in processing with computing capabilities.
  • the receiving module may have multiple data channels corresponding to the number of sound sensing devices.
  • the sound sensing device may include a device capable of sensing sound to generate an audio data stream, and capable of inputting the audio data stream into a data channel.
  • the data channel may include a carrier for transmitting audio data blocks.
  • the data channel may be a physical channel or a logical channel.
  • the data channels may be different according to the transmission path of the audio data block. Specifically, for example, two microphones are provided, and a sound source can emit sound, which is sensed by the two microphones and generates an audio data stream, and a channel through which each microphone transmits the audio data stream may be referred to as a data channel.
  • the data channels can also be logically divided. It can be understood that the audio data streams input by different microphones are processed separately, that is, the audio data streams input by one microphone are processed separately, instead of inputting multiple microphones. Audio data streams are mixed.
  • the first audio data block may be generated according to an audio data stream in a first data channel.
  • the second audio data block may be generated according to an audio data stream in the second data channel.
  • the sound sensing device may generate a corresponding audio data stream according to the sensed sound.
  • the first audio data block and the second audio data block may correspond to different sound sensing devices. Since the spatial positions of the sound sensing devices may be different, the time of the audio data stream generated by different sound sensing devices sensing the sound emitted by the sound source may also be different.
  • the first audio data block and the second audio data block may include multiple audio data blocks, respectively.
  • the receiving module may divide the audio data stream of the first data channel and the audio data stream of the second data channel into data blocks according to a certain rule, and the divided data blocks may be the audio data blocks.
  • the audio data stream may be divided into audio data blocks according to the duration or quantity. Specifically, for example, the audio data stream may be divided into one audio data block in units of 10 ms. Of course, the audio data block may not be limited to 10 ms. Alternatively, it can be divided according to the amount of data. For example, each audio data block can be up to 1MB. Alternatively, division is performed according to the continuous situation of the sound waveform represented by the audio data stream.
  • endpoint detection is performed, and there is a silent portion with a certain time difference between adjacent two consecutive waveforms, and each continuous sound waveform is divided into one audio data block.
  • the audio data block may include multiple audio data segments. Processing may be performed with the audio data segment as a basic unit.
  • the coefficient calculation module is configured to calculate a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block to obtain a peak value of the correlation coefficient.
  • the correlation coefficient can be used to indicate the closeness of the relationship between audio data blocks.
  • the correlation coefficient can be used to indicate the degree of similarity between audio data blocks. The greater the value of the correlation coefficient, the more similar the audio data segments contained in the two audio data blocks; otherwise, the smaller the value of the correlation coefficient, the more the audio data segments contained in the two audio data blocks. different.
  • the Fourier transform can be performed on the audio data segments in the audio data block according to the GCC PHAT method (Phase Transformation Weighted Generalized Cross Correlation).
  • a cross-correlation function may be generated according to the Fourier transformed audio data segment in the first audio data block and the audio data segment in the second audio data block to obtain the correlation coefficient.
  • the correlation coefficient may also be calculated according to a method such as a basic cross correlation method, a cross power spectrum phase method, and the like.
  • those skilled in the art can also adopt other modification schemes under the inspiration of the technical essence of this specification, but as long as the functions and effects achieved by it are the same or similar to this specification, they should be covered. Within the protection scope of this application.
  • two sound sources can emit sound
  • the first sound sensing device and the second sound sensing device can generate audio data streams respectively and input the corresponding first A data channel and a second data channel.
  • the time from when sound source A emits a sound to when the first sound sensing device senses the sound is time 1.
  • the time when the first sound sensing device senses the sound emitted by sound source A to input the audio data stream to the first data channel is time. 2; the length of time it takes for sound source A to emit sound to the second sound sensing device for sensing the sound is time 3, time for the second sound sensing device to sense the sound emitted by sound source A to input the audio data stream to the second data channel For time 4.
  • the audio data stream formed by the sound from the sound source A in the first data channel and the second data channel may be divided into audio data blocks including audio data segments, and then the correlation coefficient may be calculated by a method such as GCC PHAT.
  • the time from when sound source B emits a sound to when the first sound sensing device senses the sound is time 5, and the time when the first sound sensing device senses the sound emitted by sound source B to input the audio data stream to the first data channel is time 2; the time taken by the sound source B to emit sound to the second sound sensing device is 6; the time when the second sound sensing device senses the sound from the sound source B to input the audio data stream to the second data channel Time 4;
  • the audio data stream formed by the sound from the sound source A in the first data channel and the second data channel can be divided into audio data blocks including audio data segments, and then calculated by GCC, PHAT, and other methods Correlation coefficient. Therefore, two sound sources emit sounds in space, and two of the correlation coefficients can be calculated.
  • each sound sensing device may correspond to one user. It makes it possible to distinguish different users by each sound sensing device. Furthermore, the audio data stream input by each sound sensing device is processed, so that an audio file corresponding to each user can finally be obtained. So that each audio file can accurately represent the user's voice.
  • the time difference determining module may be configured to use the time difference between the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak as the reference time difference.
  • the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak are most similar.
  • both include audio data originating from the same sound source at most.
  • the time difference of the audio data segment corresponding to the peak of the correlation coefficient can be used to characterize the time difference of the audio data of the first audio data block and the second audio data block originating from the same sound source.
  • the time difference may be used as a reference for determining whether the audio data in the first audio data block is crosstalk data. In this way, the time difference can be used as a reference time difference.
  • the acquisition time of the audio data segment may be the generation time of the audio data of the audio data segment in the audio sensing device, or the reception time of the audio data of the audio data segment received by the receiving module.
  • the distance between the first audio sensing terminal and the second audio sensing terminal is very close.
  • the first audio sensing terminal and the second audio sensing terminal respectively sense the voice of the user and generate audio data respectively. It may be assumed that a first audio sensing terminal senses the sound to generate a first audio data block, and a second audio sensing terminal senses the sound to generate a second audio data block.
  • the generation time of the first audio data block and the second audio data block are relatively close, but because the distance between the user and the first audio sensing terminal and the second audio sensing terminal are different, the first audio data block and the second audio data block The build time is close.
  • a processing module configured to use the time difference between the acquisition time of the audio data segment of the first audio data block and the corresponding audio data segment in the second audio data block as the audio segment time difference; When the reference time difference does not match, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • the audio data segment in the first audio data block and the audio data segment in the second audio data block may be determined to correspond to each other according to whether they originate from the same sound source. Or, according to the above-mentioned introduction of the correlation coefficient, it is considered that the audio data segment in the first audio data block corresponding to the correlation coefficient corresponds to the audio data segment in the second audio data block.
  • an audio segment time difference between an audio data segment in a first audio data block and an audio data segment in a corresponding second audio data block may be calculated.
  • the audio segment time difference may be based on the time when the sound sensing device detects the sound wave to generate the corresponding audio data segment, or the audio data segment is input to the data channel for the sound sensing device.
  • the time of data segmentation shall prevail.
  • the calculation method of the time difference may be the same as the calculation method of the reference time difference.
  • the mismatch may include that the audio segment time difference is not equal to the reference time difference, or a second specified threshold is set, and when the audio segment time difference is different from the reference time difference If the absolute value of the difference is greater than the second specified threshold, it may be determined that the audio segment time difference does not match the reference time difference.
  • a second specified threshold value of 0.002 is set, the audio segment time difference is 0.03, the reference time difference is 0.035, and the absolute value of the difference between the two is 0.005. Therefore, the audio data score can be considered Segments are included as crosstalk data.
  • different sound sources correspond to different sound sensing devices, and may have different time differences. See Figure 3.
  • the distance from the sound source A to the first sound sensing device in space is shorter than the distance to the second sound sensing device. Therefore, time 1 is less than time 6.
  • there is an audio segment time difference between the audio data segments originating from sound source A in the first data channel and the second data channel for example, the difference between time 6 and time 1.
  • the audio segment time difference corresponds to the sound source A.
  • the spatial position of the sound source A and the first sound sensing device and the second sound sensing device is unchanged, the value of the audio segment time difference is also unchanged.
  • sound source B When the spatial position of the sound source A and the first sound sensing device and the second sound sensing device is unchanged, the value of the audio segment time difference is also unchanged. The same applies to sound source B.
  • a part may originate from sound source A, while a part may originate from sound source B. Also in the audio data segment of the second data channel, part of it may originate from sound source A and part of it may originate from sound source B.
  • the audio segment time difference can be used to distinguish the sound source in the first data channel.
  • the crosstalk data can be understood as that in the first data channel, the audio data segment originating from the sound source B is crosstalk data. That is, the crosstalk data may be a segment of audio data originating from a sound source other than the target sound source.
  • the corresponding audio data segment in the first audio data block is derived from the sound source corresponding to the data channel where the first audio data block is located. This audio data segment needs to be reserved for further processing use. It can be considered that when the time difference of the audio segment does not match the reference time difference, the corresponding audio data segment in the first audio data block is not derived from the sound source corresponding to the data channel where the first audio data block is located. This makes the audio data segment need to be removed from the first audio data block.
  • calculating a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block to obtain a peak value of the correlation coefficient may include: calculating the first audio A correlation coefficient between the audio data segment in the data block and the audio data segment in the second audio data block forms a correlation coefficient set; and the maximum value of the correlation coefficient set is used as the peak.
  • a correlation coefficient between an audio data segment of the first audio data block and an audio data segment of the second audio data block may be calculated to form a correlation coefficient set.
  • the peak of the correlation coefficient is selected from the correlation coefficient set.
  • the correlation between the audio data segment in the first audio data block and the audio data segment in the second audio data block may be calculated according to the cross-correlation function.
  • the audio data segments in the first audio data block and the audio data segments in the second audio data block may be separately calculated for the correlation degree, and the maximum value in the calculated correlation degree may be regarded as the maximum value
  • the audio data segment in the corresponding second audio data block corresponds to the audio data segment in the first audio data block.
  • the maximum value can be regarded as a correlation coefficient corresponding to the audio data segment in the first audio data block.
  • a correlation coefficient corresponding to each audio data segment in the first audio data block can be obtained, and an audio data segment in the first audio data block and an audio in the second audio data block can be caused by the correlation coefficient.
  • the data segments correspond.
  • the peak value may be a maximum value in the correlation coefficient set; or the correlation coefficients are arranged according to an arrangement manner of the corresponding audio data segments, so that the correlation coefficients are continuously distributed, and the overall can be displayed. Crests and troughs, etc., the peaks may be correlation coefficients represented by the crests.
  • the coefficient calculation module may calculate a peak of a correlation coefficient, and the number of the peaks may be two or more.
  • the processing module is in the step of using the time difference between the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak as the reference time difference. Including: calculating the time difference between the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the two peaks, respectively, the first time difference and the second time Time difference; wherein the smaller of the first time difference and the second time difference is used as the reference time difference.
  • the two or more peaks of the correlation coefficient may be peaks having two or more correlation coefficients of the first audio data block and the second audio data block obtained according to a cross-correlation function.
  • a specified interval may be given in the correlation coefficient set, and two maximum values within the specified interval may be used as the peak.
  • a value in the correlation coefficient set may be set as a peak, and after a certain data interval, there is another correlation coefficient that tends to be equal to the peak as another peak; As the peak.
  • the correlation coefficient set which are a first correlation coefficient and a second correlation coefficient, respectively.
  • the audio data segments in the first audio data block corresponding to the first correlation coefficient and the second correlation coefficient are used as the first target audio data segment and the second target audio data segment, respectively, and the corresponding second audio data block
  • the audio data segment is used as the first auxiliary audio data segment and the second auxiliary audio data segment.
  • each sound sensing device corresponds to one user
  • the target audio data segment corresponding to the correlation coefficient with a small time difference tends to originate from the sound source corresponding to the sound sensing device. Therefore, a smaller time difference among a plurality of calculated time differences may be used as the reference time difference.
  • the crosstalk data detection system may further implement: the larger of the first time difference and the second time difference as the crosstalk time difference. Accordingly, the processing module may determine that the audio data segment includes crosstalk data when the audio segment time difference matches the crosstalk time difference.
  • matching the audio segment time difference with the crosstalk time difference may include: the audio segment time difference is equal to the crosstalk time difference, or setting a first specified threshold, and the audio segment time difference with the crosstalk time difference In a case where the absolute value of the difference in the time difference of the sound is smaller than the first specified threshold, the audio segment time difference can be considered to match the crosstalk time difference.
  • the first specified threshold value can be set to 0.008, assuming that the audio segment time difference is 0.042, the crosstalk time difference is 0.040, and the absolute value of the difference between the two is 0.002 is less than the first specified threshold value. Therefore, it can be determined that The audio data segment includes crosstalk data.
  • the crosstalk time difference by determining the crosstalk time difference, detecting the crosstalk data in the first audio data block according to the crosstalk time difference, and determining the audio when the audio segment time difference matches the crosstalk time difference.
  • the data is segmented into crosstalk data.
  • the system may further include a marking module.
  • the marking module is configured to mark the audio data segment in the first audio data block corresponding to the correlation coefficient as valid data when the correlation coefficient is greater than a set coefficient value. Accordingly, the processing module uses the time difference as the audio section time difference only if the audio data section is marked as valid data.
  • noise data in an audio data block can be eliminated by using a correlation coefficient.
  • Two sound sensing devices that are relatively close to each other can sense the sound of the same sound source to generate an audio data stream.
  • the audio data streams output by the two sound sensing devices are more related between the divided audio data segments.
  • the calculated correlation coefficient has a relatively large value. If, the correlation coefficient between one audio data segment in the first audio data block and the corresponding audio data segment in the second audio data block is small. It can be considered that there is less similarity between the two audio data segments. It is considered that the two audio data segments do not originate from the same sound source. Alternatively, the audio data segmentation may be caused by noise in the electronic device itself.
  • the correlation coefficient value by setting a correlation coefficient value to the correlation coefficient, it is possible to divide an audio data segment whose correlation coefficient is greater than or equal to the set coefficient value, and an audio data segment whose correlation coefficient is less than the set coefficient value. In this way, the audio data segments whose correlation coefficients are less than the set coefficient value can be used as noise data without further calculation processing, which reduces the system's calculation compliance.
  • the way of setting the set coefficient value may include: setting an empirical value directly through a program; or analyzing the distribution of the correlation coefficient corresponding to the audio data segment in the audio data block, multiplying the average of the correlation coefficient by one A coefficient less than 1 is used as the set coefficient value.
  • the average of the correlation coefficient is one third, or one quarter.
  • the coefficient calculation module may calculate a correlation coefficient between the audio data segment of the first audio data block and the audio data segment of the second audio data block, and obtain a number of peaks of the correlation coefficient as one.
  • the time difference determining module may use the time difference between the acquisition time of the audio data segment in the first audio data block and the audio data segment in the second audio data block corresponding to the peak as the reference time difference And when the signal strength of the first audio data block is higher than the signal strength of the second audio data block, determining the time difference as a reference time difference.
  • the signal strength of the first audio data block is higher than the signal strength of the second audio data block may include calculating sound pressure values or energy of the first audio data block and the second audio data block. Or, a person skilled in the art, under the inspiration of the technical essence of this specification, calculates other attributes that can reflect the signal strength of the first audio data block and the second audio data block, but as long as it reflects the signal strength, it is the same as this specification Or similar, shall be covered by the protection scope of this application. Compare the signal strengths of the first audio data block and the second audio data block. If the signal strength of the first audio data block is greater than the signal strength of the second audio data block, the processing can obtain the The time difference is determined as the reference time difference.
  • an angle for calculating energy is taken as an example.
  • the energy of the first audio data block is greater than the energy of the corresponding second audio data block may include calculating the energy of the audio data in the first audio data block, and calculating the energy of the energy in the first audio data block according to the calculation.
  • An average value to obtain a first average value calculate the energy of the audio data in the second audio data block, and obtain a second average value according to the calculated average value of the energy in the second audio data block.
  • the first average value and the second average value may be compared, and if the first average value is greater than the second average value, it is determined that the energy of the first audio data block is greater than the energy of the corresponding second audio data block .
  • a threshold may be set, and in a case where the first average energy minus the second average energy is greater than the set threshold, it may be determined that the energy of the first audio data block is greater than the corresponding second audio data block energy of.
  • the distance between the sound sources corresponding to the sound sensing device is generally smaller than the distance between the sound sensing device and other sound sources. After the sound is emitted from the sound source, a certain attenuation occurs with distance. In this way, the audio data stream generated by the corresponding sound source is sensed by the sound sensing device, and its characteristic energy or sound pressure value will be relatively large. In some cases, the signal strength in the first audio data block is weaker than that in the second audio data block, so it can be understood that the current sound source should correspond to the data channel in which the second audio data block is located, or, The sound sensing devices of the two audio data blocks correspond.
  • the audio data included in the first audio data block may not originate from the sound source corresponding to the first data channel, or at least part of the audio data in the first audio data block
  • the segment is not derived from the sound source corresponding to the first data channel.
  • a correlation coefficient between the audio data segment in the first audio data block and the audio data segment in the second audio data block forms a correlation coefficient set; the time difference determination module may also implement : Count the statistical quantity of the correlation coefficient set whose correlation coefficient set is greater than the set coefficient value; correspondingly, the signal strength at the first audio data block is higher than the signal strength at the second audio data block, and the statistics Only when the number is greater than the set number threshold, the time difference is determined as the reference time difference.
  • the audio data segment in the first audio data block may be distinguished from valid data or noise data according to a correlation coefficient.
  • the correlation coefficient in the correlation coefficient set may be compared with a set coefficient value.
  • the correlation coefficient is larger than the set coefficient value, and the audio data corresponding to the correlation coefficient can be regarded as valid data.
  • the statistical number is greater than the set number threshold, and it can be understood that the number of valid data in the audio data block is greater than the set number threshold. In some cases, if the number of statistics is less than the set number threshold, it can be considered that the valid data in the audio data block is very small, and no further processing can be performed to reduce the amount of calculation.
  • the time difference determination module may further implement: when the signal strength of the first audio data block is weaker than the signal strength of the second audio data block, determining the time difference as crosstalk Time difference; correspondingly, the processing module determines that the audio data segment includes crosstalk data when the audio segment time difference matches the crosstalk time difference.
  • the case where the signal strength of the first audio data block is weaker than the signal strength of the second audio data block may include: the energy of the first audio data block is less than the corresponding second audio data block Or the sound pressure value of the first audio data block is smaller than the sound pressure value of the corresponding second audio data block.
  • the audio data segment of the first audio data block can be directly detected. Therefore, it is determined whether the audio data segment in the first audio data block originates from a sound source corresponding to the first data channel.
  • the crosstalk data detection system may include a client and a server.
  • the client may include an electronic device with data receiving and sending capabilities.
  • the client may include at least two sound sensing devices and a network communication unit.
  • the sound sensing device may be configured to sense a sound emitted by a sound source and generate corresponding audio data.
  • the sound sensing device may be a microphone or a microphone provided with the microphone. The microphone is used to convert sound into an electrical signal to obtain an audio data stream.
  • Each sound sensing device may correspond to a data channel, and the sound sensing device may provide an audio data stream generated by the sound sensing device to the network communication unit through the data channel.
  • the at least two sound sensing devices may include a first sound sensing device and a second sound sensing device. Accordingly, the first sound sensing device may correspond to a first data channel, and the second sound sensing device may correspond to a second data channel.
  • the network communication unit includes a device that performs network data communication in accordance with a network communication protocol.
  • the network communication unit may receive audio data provided by the sound sensing device, and may also send the audio data to the server.
  • the network communication unit may send the received audio data to the server through the data channel.
  • the client may have weak data processing capabilities, and may be an electronic device such as an IoT device.
  • the client may have a receiving module and a sending module.
  • the network communication unit of the client can implement the function of the sending module.
  • the server may include an electronic device with a certain computing processing capability, which may include a network communication unit, a processor, a memory, and the like.
  • the server may also refer to software running in the electronic device.
  • the server may also be a distributed server, a system with multiple processors, network communication modules, and other cooperative operations, or a server cluster formed by several servers.
  • the server may also be implemented by using cloud computing technology. That is, the functional modules run by the server run using cloud computing technology.
  • the network communication unit may be a device that performs network data communication in accordance with a network communication protocol. Can be used to receive audio data streams provided by the client.
  • the network communication unit may serve as the receiving module.
  • the server may have a receiving module, a coefficient calculation module, a time difference determination module, and the processing module.
  • the network communication unit may implement the function of the receiving module.
  • the processor may be implemented in any suitable manner.
  • the processor may take, for example, a microprocessor or processor and a computer-readable medium, logic gate, switch, dedicated integration storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor.
  • Circuit Application Specific Integrated Circuit, ASIC
  • programmable logic controller and embedded microcontroller form, etc.
  • an embodiment of the present specification further provides a crosstalk data detection system.
  • the crosstalk detection system may include a client and a server.
  • the client may include at least two sound sensing devices, a processor, and a network communication unit.
  • the client may be a device with a certain processing capability.
  • the client may be a laptop computer or a smart terminal device.
  • the network communication unit may implement a receiving module, and the coefficient calculation module may be located in the processor.
  • the network communication unit may be a device that performs network data communication in accordance with a network communication protocol.
  • the processor of the server may run the time difference determination module and the processing module described above.
  • the processor of the server may run the time difference determination module and the processing module described above.
  • reference may be made to other embodiments for explanation.
  • the coefficient calculation module and the time difference determination module may also be run for the client to send a reference time difference, a first audio data block, and a second audio data block.
  • the server may run only the processing module.
  • An embodiment of the present specification also provides a crosstalk data detection system. It is an interaction diagram of the crosstalk data detection system.
  • the crosstalk detection system may include a client and a server.
  • the client may include at least two sound sensing devices and a processor.
  • the client may have strong processing capabilities.
  • the processor may run the coefficient calculation module, the time difference determination module, and the processing module. In this scenario, there is no need to interact with the server.
  • the audio data blocks processed by the processing module may be provided to the server.
  • the client may be, for example, a tablet computer, a notebook computer, a desktop computer, a workstation, or the like having higher performance.
  • An embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed, the computer program is implemented to: receive a first audio data block and a second audio data block;
  • the audio data block and the second audio data block each include a plurality of audio data segments, and the first audio data block is determined according to a correlation coefficient between the first audio data block and the second audio data block.
  • the target audio data segment in and the auxiliary audio data segment in the second audio data block; wherein at least part of the data in the target audio data segment and the auxiliary audio data segment originate from the same sound source The correlation coefficient is used to indicate the degree of similarity between audio data segments; determining the first audio data block and the second audio data according to the target audio data segment and the auxiliary audio data segment Reference time difference of a block; calculating an audio segment of an audio data segment of the first audio data block and a corresponding audio data segment of a second audio data block Difference; in the case of the audio segment time difference with the reference time difference does not match, determining that the data block corresponding to a first audio segment of audio data comprises data crosstalk.
  • the computer storage medium includes, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Cache, and Hard Disk Drive. HDD) or memory card.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • HDD Hard Disk Drive
  • An embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed, the computer program is implemented to: receive a first audio data block and a second audio data block;
  • the audio data block and the second audio data block each include a plurality of audio data segments, and the first audio data block is determined according to a correlation coefficient between the first audio data block and the second audio data block.
  • the target audio data segment in and the auxiliary audio data segment in the second audio data block; wherein at least part of the data in the target audio data segment and the auxiliary audio data segment originate from the same sound source Determining a reference time difference between the first audio data block and the second audio data block according to the target audio data segment and the auxiliary audio data segment; comparing the reference time difference and the first audio data
  • the block and the second audio data block are sent to a server for the server to calculate an audio data segment of the first audio data block and a second audio block According to the audio segment time difference of the corresponding audio data segment in the block; if the audio segment time difference does not match the reference time difference, determining that the corresponding audio data segment of the first audio data block includes a string Tone data.
  • An embodiment of the present specification also provides a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed, the computer program is implemented to receive a first audio data block, a second audio data block, and a reference time difference;
  • the first audio data block and the second audio data block include a plurality of audio data segments, respectively; and the calculation of the audio data segments of the first audio data block and the corresponding audio data segments in the second audio data block is performed. Audio segment time difference; in a case where the audio segment time difference does not match the reference time difference, determining that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed, it is implemented to receive a first audio data block and a second audio data block;
  • the audio data block and the second audio data block each include a plurality of audio data segments, and the first audio data block is determined according to a correlation coefficient between the first audio data block and the second audio data block.
  • the target audio data segment in and the auxiliary audio data segment in the second audio data block wherein at least part of the data in the target audio data segment and the auxiliary audio data segment originate from the same sound source Sending the target audio data segment, the auxiliary audio data segment, the first audio data block, and the second audio data block to a server for use by the server according to the target audio data segment and Segmenting the auxiliary audio data, determining a reference time difference between the first audio data block and the second audio data block; calculating the first audio data
  • the audio segment time difference between the audio data segment of the block and the corresponding audio data segment in the second audio data block, and the first audio is determined if the audio segment time difference does not match the reference time difference
  • the corresponding audio data segment of the data block includes crosstalk data.
  • An embodiment of the present specification also provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed to implement: receiving the target audio data segment, auxiliary audio data segment provided by a client, The first audio data block and the second audio data block; wherein the first audio data block and the second audio data block each include a plurality of audio data blocks; wherein the target audio data segment Selected from the first audio data block, and the auxiliary audio data segment is selected from the second audio data block; determining the first audio data segment according to the target audio data segment and the auxiliary audio data segment A reference time difference between an audio data block and the second audio data block; calculating an audio segment time difference between an audio data segment of the first audio data block and a corresponding audio data segment in the second audio data block; In a case where the audio segment time difference does not match the reference time difference, it is determined that the corresponding audio data segment of the first audio data block includes crosstalk data.
  • An embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed, the computer program is implemented to: receive a first audio data block and a second audio data block;
  • the audio data block and the second audio data block include a plurality of audio data segments, respectively; sending the first audio data block and the second audio data block to a server for the server to use according to the first A correlation coefficient between an audio data block and the second audio data block, determining a target audio data segment in the first audio data block, and an auxiliary audio data segment in the second audio data block, Wherein, at least part of the data in the target audio data segment and the auxiliary audio data segment originates from the same sound source; and the first audio is determined according to the target audio data segment and the auxiliary audio data segment.
  • first and second in the embodiments of the specification are only for distinguishing different data channels from audio data blocks, and the number of data channels and audio data blocks is not limited here.
  • the data channel and the audio data block may include multiple but not limited to two.
  • This manual can be used in many general-purpose or special-purpose computer system environments or configurations.

Abstract

本说明书公开了串音数据检测方法和电子设备。所述串音数据检测方法能检测音频数据流中是否包括串音数据。

Description

串音数据检测方法和电子设备
本申请要求2018年07月12日递交的申请号为201810763010.9、发明名称为“串音数据检测方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书涉及计算机技术领域,特别涉及串音数据检测方法和电子设备。
背景技术
在现实生活中,人们会在一起沟通、讨论事项。在一些场景中,可以使用麦克风对声源进行放大,现场多个麦克风可以采集每个角色的音频数据。在一些情况下,两个以上麦克风距离很近的情况下,可能会出现串音现象。
发明内容
本说明书实施方式提供一种可以检测串音数据的串音数据检测方法和电子设备。
本说明书实施方式提供一种串音数据检测方法,包括:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种电子设备,包括:第一声音感应装置,用于产生第一音频数据块;所述第一音频数据块中包括多个音频数据分段;第二声音感应装置,用于产生第二音频数据块;所述第二音频数据块中包括多个音频数据分段;处理器,用于计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频 分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种串音数据检测方法,包括:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述基准时差、所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种串音数据检测方法,包括:接收第一音频数据块、第二音频数据块和基准时差;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种串音数据检测方法,包括:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值、第一音频数据块和第二音频数据块发送给服务器,以用于所述服务器将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种串音数据检测方法,包括:接收客户端提供的相关系数的峰值、第一音频数据块和第二音频数据块;其中,所述峰值是所述第一音频数据块的音频数据分段与所述第二音频数据块的音频数据分段的相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块 中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
本说明书实施方式提供一种串音数据检测方法,包括:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;将所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
由以上本说明书实施方式提供的技术方案可见,通过确定所述第一音频数据块和所述第二音频数据块之间的基准时差,实现根据所述基准时差检测串音数据。由于声音的延时信息和声源和麦克风的空间位置相关,所以通过延时的时差可以较为有效的检测音频数据块中是否包括串音数据。
附图说明
为了更清楚地说明本说明书实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书实施方式提供的一种串音数据检测系统的示意图;
图2为本说明书实施方式提供的一种串音数据检测系统在辩论比赛场景下的应用场景示意图;
图3为本说明书实施方式提供的音频数据块传输路径示意图;
图4为本说明书实施方式提供的一种串音数据检测系统的模块示意图;
图5为本说明书实施方式提供的一种串音数据检测系统的模块示意图;
图6为本说明书实施方式提供的一种串音数据检测系统的模块示意图。
具体实施方式
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施方式中的附图,对本说明书实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式仅仅是本说明书一部分实施方式,而不是全部的实施方式。基于本说明书中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都应当属于本说明书保护的范围。
请参阅图1和图2,在一个场景示例中,在辩论比赛现场,正反方各有4名辩手,分别坐在两边的长桌上,每个长桌上放置两个麦克风,用于感应辩手所发出的声音,并由功放放大话筒所感应到的声音。
在本场景示例中,辩手甲对着自己身前的麦克风A讲话,说道:“我认为全球化有利于发展中国家……”,由于麦克风A和麦克风B之间距离较近,麦克风B同样可以感应到“我认为全球化有利于发展中国家……”的声音。同时,辩手乙对着自己身前的麦克风B讲话,说道:“全球化有利于贸易发展……,麦克风A同样可以感应到“全球化有利于贸易发展……”的声音。因此麦克风A和麦克风B都可以根据所感应到的声音生成各自对应的音频数据流。
在本场景示例中,可以设置一个电子设备,所述电子设备可以通过接收模块接收所述麦克风A和所述麦克风B生成的音频数据流,并对音频数据流进行处理,检测音频数据流中的串音数据。
在本场景示例中,在辩手甲在对麦克风A说:“我认为全球化有利于发展中国家……”以及辩手乙对麦克风B说:“全球化有利于贸易发展……”的过程中,所述电子设备接收麦克风A所感应到的声音并生成音频数据流。同时,由于麦克风B同样可以根据感应到声音生成音频数据流。接收模块可以对应麦克风的数量具有多个数据通道。麦克风A对应数据通道A,麦克风B对应数据通道B。在本场景示例中,可以共有8个麦克风所述电子设备可以具有8个数据通道。进一步的,电子设备可以通过WIFI的方式接收麦克风在数据通道内输入的音频数据流。
在本场景示例中,接收模块可以将音频数据流划分为音频数据块。具体的,可以将数据通道A中的音频数据流划分得到第一音频数据块,将数据通道B中的音频数据流划分得到第二音频数据块。
在本场景示例中,所述电子设备可以将所述数据通道A输入的音频数据流作为目标,根据数据通道A和数据通道B中音频数据流之间的关联,检测数据通道A中音频数据流 中是否存在串音数据。
在本场景示例中,可以将所述第一音频数据块和第二音频数据块,每个音频数据块按照1000ms为单位分成若干音频数据分段。
在本场景示例中,所述电子设备的系数计算模块可以分别对所述第一音频数据块和第二音频数据块进行傅里叶变换,根据傅里叶变换后的第一音频数据块和第二音频数据块生成互相关函数。可以根据该互相关函数计算第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段之间的相关度。具体的,可以将第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段分别计算相关度,而将计算得出的相关度中的最大值,可以认为最大值对应的第二音频数据块中的音频数据分段,与第一音频数据块中的音频数据分段相对应。如此,可以将该最大值认为是所述第一音频数据块中的该音频数据分段最终的相关系数。
在本场景示例中,可以根据互相关函数计算,得出音频数据块中音频数据分段对应的相关系数。由于存在两个人在趋于同一时间说话,使得音频数据块中音频数据分段对应的相关系数可以具有两个峰值,分别为0.3和0.5。可以将0.3确定为第一相关系数,将0.5确定为第二相关系数。
在本场景示例中,可以设定阈值,根据该阈值对音频数据分段进行筛选,得出音频数据块中的有效数据。例如,阈值可以为0.1,设定在相关系数大于0.1的情况下,认为所述相关系数对应的第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段之间相似度较高。可以认为该两个音频数据分段可能源于同一声源,即可以认为是有效数据。在相关系数小于0.1的情况下,认为该相关系数对应的第一音频数据块和第二音频数据块中的音频数据分段之间相似程度较低可以认为所述相关系数对应的第一音频数据块中的音频数据分段可能是噪音。在本场景示例中,可以不对认为是的噪声的音频数据分段进行串音检测。由于所述第一相关系数和第二相关系数为0.3和0.4,均大于0.1,因此,可以认为所述第一相关系数和所述第二相关系数对应的音频数据分段是有效数据。
在本场景示例中,系数计算模块可以将第一相关系数对应的第一音频数据块中的音频数据分段确定为第一目标音频数据分段,以及将第二相关系数对应的第一音频数据块中的音频数据分段确定为第二目标音频数据分段。将第一相关系数对应的第二音频数据块中的音频数据分段确定为第一辅助音频数据分段,将第二相关系数对应的第二音频数据块中的音频数据分段确定为第二辅助音频数据分段。
在本场景示例中,电子设备的时差确定模块可以计算第一目标音频数据分段与第一辅助音频数据分段之间的第一时间差。例如,第一时间差可以为30ms。计算第二目标音频数据分段与第二辅助音频数据分段之间的第二时间差。例如,第二时间差可以为60ms。
在本场景示例中,时差确定模块可以将第一时间差和第二时间差中,较小的一个确定为基准时差,另一个确定为串音时差。即,可以将基准时差确定为30ms,将串音时差确定为60ms。
在本场景示例中,所述电子设备的处理模块根据相关系数,确定所述第一音频数据块中每个音频数据分段对应的第二音频数据块中的音频数据分段,并进一步计算第一音频数据块中音频数据分段与第二音频数据块中相应音频数据分段的音频分段时差。在第一音频数据块中的音频数据分段对应的音频分段时差等于30ms的情况下,确定所述第一音频数据块中的该音频数据分段为主音频数据,在音频分段时差等于60ms的情况下,确定所述第一音频数据块中的所述音频数据分段为串音数据。
在一个场景示例中,由辩手乙进行陈述,辩手乙对着自己身前的麦克风B讲话,说道:“我认为全球化有利于发展中国家……”,由于麦克风A和麦克风B之间距离较近,麦克风A同样可以感应到“我认为全球化有利于发展中国家……”的声音。因此麦克风A和麦克风B都可以根据所感应到的声音生成各自对应的音频数据流。电子设备可以根据数据通道A和数据通道B输入的音频数据流,生成第一音频数据块和第二音频数据块。
在本场景示例中,电子设备可以计算第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段之间的相关系数。并根据相关系数对第一音频数据块中的音频数据分段进行筛选,得出第一音频数据块中是有效数据的音频数据分段为150个。进一步的,电子设备得出第一音频数据块与第二音频数据块之间的相关系数,存在一个峰值0.4,该相关系数峰值0.4对应的时间差为50ms。
在本场景示例中,所述电子设备分别计算第一音频数据块和第二音频数据块中各音频数据分段的平滑能量,统计第一音频数据块中音频数据分段的平滑能量大于第二音频数据块中音频数据分段的平滑能量的数量,统计数量为5。所述电子设备中可以设定在所述第一音频数据块分段中音频数据分段的平滑能量大于第二音频数据块中音频数据分段的平滑能量的数量与有效数据的数量的比值大于0.8的情况下,确定相关系数的峰值对应的时间差为基准时差,在小于0.2的情况下,确定相关系数的峰值对应的时间差为串音时差。由于5和150的比值小于0.2,因此,确定值为50ms的时间差为串音时差。
在本场景示例中,所述电子设备计算第一音频数据块的音频数据分段对应的时间差, 在计算得出的时间差等于50ms的情况下确定对应的语音数据为串音数据。
在本场景示例中,可以以其他数据通道为目标,检测其他数据通道传输的音频数据流中的串音数据。
在本场景示例中,在整个辩论过程中,可以进一步去除检测出的串音数据,将去除串音后的音频数据块保存在指定的音频文件中,以生成较为清晰的辩论记录。
请参阅图1,本说明书实施方式提供一种串音数据检测系统。所述串音数据检测系统可以包括接收模块、系数计算模块、时差确定模块和处理模块。本说明书下文以功能模块进行介绍所述串音数据检测系统,所述串音数据检测系统被运行时实现串音数据检测方法。使得串音数据检测方法可以参照下文功能模块进行理解,不再赘述。
接收模块可以接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段。
在本实施方式中,所述接收模块可以接收第一数据通道输入的第一音频数据块和第二数据通道输入的第二音频数据块。具体的,所述接收模块可以是接收设备,也可以是具有数据交互能力的通信模块。所述接收模块可通过有线的方式接收所述第一数据通道输入的第一音频数据块和第二数据通道输入的第二音频数据块。也可以基于HTTP、TCP/IP或FTP等网络协议或通过例如,WIFI模块、ZigBee模块、蓝牙模块、Z-wave模块等无线通信模块接收所述第一数据通道输入的第一音频数据块和第一数据通道输入的第二音频数据块。当然,接收模块也可以是指作为软件程序接口,可以运行于具有运算能力的处理其中。
在本实施方式中,所述接收模块可以对应声音感应装置的数量具有多个数据通道。所述声音感应装置可以包括能够感应声音生成音频数据流,并能够将音频数据流输入数据通道的设备。例如麦克风,录音笔等。在本实施方式中,所述数据通道可以包括音频数据块传输的载体。所述数据通道可以是物理通道也可以是逻辑通道。根据音频数据块的传输路径,所述数据通道可以不同。具体地,例如,设置两个麦克风,声源可以发出声音由这两个麦克风进行感应并生成音频数据流,每个麦克风传输所述音频数据流的通道可以称为一个数据通道。当然,数据通道也可以为逻辑上划分的,可以理解为,针对不同麦克风输入的音频数据流,分别进行处理,即将一个麦克风输入的音频数据流进行单独的处理,而不是将多个麦克风输入的音频数据流进行混杂。
在本实施方式中,所述第一音频数据块可以是根据第一数据通道中的音频数据流生成。所述第二音频数据块可以是根据所述第二数据通道中的音频数据流生成。所述声音 感应装置可以根据感应到的声音生成相应的音频数据流。所述第一音频数据块和所述第二音频数据块可以对应不同的声音感应装置。由于所述声音感应装置所处的空间位置可以不同,所以不同声音感应装置感应到声源发出的声音而生成的音频数据流的时间也可以有所不同。
在本实施方式中,所述第一音频数据块和所述第二音频数据块可以分别包括多个音频数据块。所述接收模块可以按照一定的规则将所述第一数据通道的音频数据流和所述第二数据通道的音频数据流进行数据块划分,划分出的数据块可以是所述音频数据块。可以根据时长或数量的大小对所述音频数据流进行划分成音频数据块。具体地,例如可以将所述音频数据流以10ms为单位划分为一个音频数据块。当然,所述音频数据块可以不限于10ms。或者,也可以按照数据量进行划分。例如,每个音频数据块最多1MB。或者,按照所述音频数据流的表示的声音波形的连续情况进行划分。例如,进行端点检测,在相邻两个连续的波形之间存在持续一定时差的无声部分,将每个连续的声音波形划分为一个音频数据块。所述音频数据块中可以包括多个音频数据分段。可以以所述音频数据分段为基本单位进行处理。
所述系数计算模块用于计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值。
在本实施方式中,相关系数可以用于表示音频数据块之间的关系密切程度。或者,相关系数可以用于表示音频数据块之间的相似程度。相关系数的取值越大,可以表示两个音频数据块中包含的音频数据分段越相似;反之,相关系数的取值越小,可以表示两个音频数据块中包含的音频数据分段越不同。
在本实施方式中,可以根据GCC PHAT方法(相位变换加权广义互相关)分别对音频数据块中的音频数据分段进行傅里叶变换。可以根据傅里叶变换后的第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段生成互相关函数,以得到所述相关系数。当然,也可以根据基本互相关法、互功率谱相位法等方法计算得到所述相关系数。当然,对于得到所述相关系数,所属领域技术人员在本说明书的技术精髓的启示下,还能采用其他的变更方案,但只要其实现的功能和效果,与本说明书相同或相似,均应涵盖于本申请保护范围内。
在本实施方式中,请参阅图3,在一定的空间内,可以有两个声源发出声音,第一声音感应装置和第二声音感应装置可以分别生成音频数据流,并输入相对应的第一数据通道和第二数据通道。声源A发出声音到第一声音感应装置感应到声音所经历的时长为 时间1,第一声音感应装置感应到声源A发出的声音向所述第一数据通道输入音频数据流的时间为时间2;声源A发出声音到第二声音感应装置感应到声音所经历的时长为时间3,第二声音感应装置感应到声源A发出的声音向所述第二数据通道输入音频数据流的时间为时间4。可以将声源A发出的声音在第一数据通道和第二数据通道中形成的音频数据流,划分成包括音频数据分段的音频数据块后,通过GCC PHAT等方法计算所述相关系数。声源B发出声音到第一声音感应装置感应到声音所经历的时长为时间5,第一声音感应装置感应到声源B发出的声音向所述第一数据通道输入音频数据流的时间为时间2;声源B发出声音到第二声音感应装置感应到声音所经历的时长为时间6,第二声音感应装置感应到声源B发出的声音向所述第二数据通道输入音频数据流的时间为时间4;可以将声源A发出的声音在第一数据通道和第二数据通道中形成的音频数据流,划分成包括音频数据分段的音频数据块后,通过GCC PHAT等方法计算所述相关系数。因此空间内存在两个声源发出声音,可以计算得到两个所述相关系数。
在本实施方式中,每个声音感应装置可以对应一个用户。使得可以通过每个声音感应装置来区分不同的用户。进而,针对每个声音感应装置输入的音频数据流进行处理,使得最终可以得到对应每个用户的音频文件。使得每个音频文件可以较为准确的表征用户的声音。
时差确定模块可以用于将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差。
在本实施方式中,可以认为所述峰值对应的第一音频数据块中的音频数据分段和第二音频数据块中的音频数据分段之间,二者最为相似。或者,二者中包括最多源于同一声源的音频数据。如此,相关系数的峰值对应的音频数据分段的时间差,可以用于表征第一音频数据块和第二音频数据块源于同一声源的音频数据的时间差。可以依照该时间差作为判断第一音频数据块中的音频数据是否为串音数据的基准。如此,可以将该时间差作为基准时差。
在本实施方式中,音频数据分段的获取时间可以是音频数据分段的音频数据在音频感应装置的生成时间,也可以是接收模块接收到音频数据分段的音频数据的接收时间。具体的,例如,第一音频感应终端和第二音频感应终端距离很近,在一个用户说话时,由于用户说话发出的声音到达第一音频感应终端和第二音频感应终端的时间很接近,使得第一音频感应终端和第二音频感应终端分别感应到该用户说话的声音,分别生成音频数据。可以假设,第一音频感应终端感应所述声音生成第一音频数据块,第二音频感应 终端感应所述声音生成第二音频数据块。如此,第一音频数据块和第二音频数据块的生成时间较为接近,但因用户与第一音频感应终端和第二音频感应终端的距离不同,使得第一音频数据块和第二音频数据块的生成时间接近。
处理模块,用于将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,第一音频数据块中的音频数据分段和第二音频数据块中的音频数据分段,可以根据是否源于同一声源,而确定是否相对应。或者,可以根据上述介绍相关系数,认为相关系数对应的第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段相对应。
在本实施方式中,可以计算第一音频数据块中的音频数据分段与对应的第二音频数据块中的音频数据分段之间的音频分段时差。该音频分段时差可以以声音感应装置感应到声波生成相应音频数据分段的时间为准,也可以为声音感应装置将音频数据分段输入至数据通道为准,还可以为接收模块接收到音频数据分段的时间为准。具体的,所述时间差的计算方式可以与基准时差的计算方式趋于相同。
在本实施方式中,所述不相匹配可以包括,所述音频分段时差与所述基准时差不相等,或者设定一个第二指定阈值,在所述音频分段时差与所述基准时差的差值的绝对值大于所述第二指定阈值的情况下可以确定所述音频分段时差与所述基准时差不相匹配。具体地,例如,设定一个0.002的第二指定阈值,所述音频分段时差为0.03,所述基准时差为0.035,两者差值的绝对值为0.005,因此,可以认为所述音频数据分段包括为串音数据。
在本实施方式中,不同声源相应于不同的声音感测装置,可以对应有不同的时差。请参阅图3。声源A在空间上距离第一声音感应装置的距离短于与第二声音感应装置之间的距离。使得,时间1小于时间6,如此,在第一数据通道和第二数据通道中源于声源A的音频数据分段之间存在音频分段时差,例如为时间6与时间1的差值。该音频分段时差与声源A相对应,在声源A与第一声音感应装置和第二声音感应装置的空间位置不变的情况下,该音频分段时差的取值也不变。同理适用于声源B。在第一数据通道的音频数据分段中,可能一部分源于声源A,而存在一部分源于声源B。同样在第二数据通道的音频数据分段中,也可能一部分源于声源A,一部分源于声源B。通过计算相关系数对应的第一数据通道中的音频数据分段和第二数据通道中音频数据分段的音频分段 时差,进而可以通过音频分段时差进行区分第一数据通道中源于声源A的音频数据分段,和源于声源B的音频数据分段。串音数据可以理解为第一数据通道中,源于声源B的音频数据分段是串音数据。即,串音数据可以是源于目标声源之外的声源的音频数据分段。
在本实施方式中,可以认为音频分段时差与基准时差相匹配的情况下,第一音频数据块中相应的音频数据分段源于第一音频数据块所在的数据通道对应的声源。使得该音频数据分段需要保留,以用于进一步处理使用。可以认为音频分段时差与基准时差不相匹配的情况下,第一音频数据块中的相应音频数据分段不是源于第一音频数据块所在的数据通道对应的声源。使得该音频数据分段需要从第一音频数据块中去除。
在一个实施方式中,计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值,可以包括:计算所述第一音频数据块中的音频数据分段与所述第二音频数据块中的音频数据分段的相关系数,形成相关系数集;将所述相关系数集中的最大值作为所述峰值。
在本实施方式中,可以计算所述第一音频数据块的音频数据分段与所述第二音频数据块的音频数据分段的相关系数,形成相关系数集。其中,所述相关系数的峰值选自所述相关系数集。具体的,可以根据该互相关函数计算第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段之间的相关度。具体的,可以将第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段分别计算相关度,而将计算得出的相关度中的最大值,可以认为最大值对应的第二音频数据块中的音频数据分段,与第一音频数据块中的所述音频数据分段相对应。如此,可以将该最大值认为是所述第一音频数据块中的该音频数据分段最终对应的相关系数。如此,可以得出第一音频数据块中的每个音频数据分段对应的相关系数,以及通过相关系数使得第一音频数据块中的一个音频数据分段与第二音频数据块中的一个音频数据分段形成对应。
在本实施方式中,所述峰值可以是是所述相关系数集中的最大值;或者,所述相关系数按照对应的音频数据分段的排列方式进行排列,使得相关系数呈现连续分布,整体可以呈现波峰和波谷等,所述峰值可以是所述波峰表示的相关系数。
在一个实施方式中,所述系数计算模块可以计算得出相关系数的峰值,所述峰值的数量可以为二个以上。
相应的,所述处理模块在将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差的步骤中包括:分别计算所述二个峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块 中的音频数据分段的获取时间的时间差,分别为第一时间差和第二时间差;其中,所述第一时间差和所述第二时间差之中较小的作为所述基准时差。
在本实施方式中,相关系数的两个以上峰值可以根据互相关函数得出的所述第一音频数据块和第二音频数据块的具有两个以上相关系数的峰值。或者,可以在相关系数集中给定指定区间,在所述指定区间内的两个最大值作为所述峰值。或者,可以设定相关系数集中的一个值作为一个峰值,在一定数据间隔后,有趋于与所述峰值相等的相关系数为另一个峰值;或者,取所述相关系数集中两个第二大的值作为所述峰值。
在本实施方式中,相关系数存在二个以上峰值,可以表示音频数据块中的音频数据可能源于两个以上的声源。具体的,例如,在相关系数集中选择两个相关系数的峰值,分别为第一相关系数和第二相关系数。分别将第一相关系数和第二相关系数对应的第一音频数据块中的音频数据分段作为第一目标音频数据分段和第二目标音频数据分段,将对应的第二音频数据块中的音频数据分段作为第一辅助音频数据分段和第二辅助音频数据分段。如此,可以进一步分别计算目标音频数据分段与对应的辅助音频数据分段之间的时间差,即音频分段时差。进一步的,在每个声音感应装置对应一个用户的场景下,可以认定声音感应装置与对应的用户之间的距离,小于声音感应装置与其它用户之间的距离。如此,可以认定多个相关系数峰值的情况下,时间差较小的相关系数对应的目标音频数据分段,趋于源于声音感应装置对应的声源。所以,可以将计算得出的多个时间差中,较小的时间差作为所述基准时差。
在一个实施方式中,所述串音数据检测系统还可以实现:将所述第一时间差和所述第二时间差之中较大的作为串音时差。相应的,所述处理模块可以在所述音频分段时差与所述串音时差相匹配的情况下,确定所述音频数据分段包括串音数据。
在本实施方式中,音频分段时差与串音时差相匹配可以包括:所述音频分段时差与所述串音时差相等,或者设定第一指定阈值,在音频分段时差与所述串音时差的差值的绝对值小于所述第一指定阈值的情况下可以认为所述音频分段时差与所述串音时差相匹配。具体地,例如,可以设定第一指定阈值为0.008,假设音频分段时差为0.042,串音时差为0.040,两者差值的绝对值为0.002小于第一指定阈值,因此,可以确定所述音频数据分段包括串音数据。
在本实施方式中,通过确定串音时差,根据所述串音时差检测第一音频数据块中的串音数据,在音频分段时差与所述串音时差相匹配情况下,确定所述音频数据分段为串音数据。
在一个实施方式中,所述系统还可以包括标记模块。所述标记模块用于在相关系数大于设定系数值的情况下,标记所述相关系数对应的第一音频数据块中的音频数据分段为有效数据。相应的,所述处理模块仅在所述音频数据分段被标记为有效数据的情况下才将所述时间差作为音频分段时差。
在本实施方式中,可以通过相关系数对音频数据块中的噪音数据进行剔除。两个距离较为接近的声音感测装置,会感测到同一声源的声音产生音频数据流。使得,两个声音感测装置输出的音频数据流,分别被划分成的音频数据分段之间,较为相关。计算得出的相关系数相对具有较大的取值。如果,第一音频数据块中的一个音频数据分段,与第二音频数据块中相应音频数据分段之间的相关系数较小。可以认为,该两个音频数据分段之间具有较小的相似性。认为该两个音频数据分段并不是源于同一个声源。或者,音频数据分段可能是电子设备本身的噪声形成。
在本实施方式中,通过对相关系数设置一个设定系数值,实现划分相关系数大于或等于设定系数值的音频数据分段,以及相关系数小于设定系数值的音频数据分段。如此,可以将相关系数小于设定系数值的音频数据分段作为噪音数据,不进行进一步的运算处理,如此降低了系统的运算符合。
在本实施方式中,设置设定系数值的方式可以包括:直接通过程序设定一个经验值;或者,分析音频数据块中音频数据分段对应的相关系数的分布,相关系数均值的乘以一个小于1的系数,作为所述设定系数值。比如,相关系数平均的三分之一,或四分之一。
在一个实施方式中,所述系数计算模块可以计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值的数量为一个。相应的,所述时差确定模块可以在将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差时,在所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度的情况下,将所述时间差确定为基准时差。
在本实施方式中,所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度可以包括计算所述第一音频数据块和第二音频数据块的声压值或者能量,或者,所属领域技术人员在本说明书的技术精髓的启示下,采用的计算其他能够反映第一音频数据块和第二音频数据块的信号强度的属性,但只要反映信号强度,与本说明书相同或相似,均应涵盖于本申请保护范围内。比较所述第一音频数据块和所述第二音频数据块的 信号强度,如果所述第一音频数据块的信号强度大于第二音频数据块的信号强度的情况下,可以将处理得到所述时间差确定为基准时差。
在本实施方式中,具体的,以计算能量的角度为例。所述第一音频数据块的能量大于对应的第二音频数据块的能量可以包括,计算所述第一音频数据块中音频数据的能量,根据计算得到的所述第一音频数据块中能量的平均值,得到第一平均值;计算所述第二音频数据块中音频数据的能量,根据计算得到的所述第二音频数据块中能量的平均值,得到第二平均值。可以比较所述第一平均值和第二平均值,在所述第一平均值大于第二平均值的情况下,确定所述第一音频数据块的能量大于对应的第二音频数据块的能量。或者,可以设定一个阈值,在所述第一平均能量减去第二平均能量大于所设定的阈值的情况下,可以确定所述第一音频数据块的能量大于对应的第二音频数据块的能量。所属领域技术人员在本说明书的技术精髓的启示下,采用的其他能够确定所述第一音频数据块的能量大于对应的第二音频数据块的能量的方法,只要确定所述音频数据块中音频数据的能量大小,与本说明书相同或相似,均应涵盖于本申请保护范围内。
在本实施方式中,声音感应装置对应的声源之间的距离,通常小于该声音感应装置与其它声源之间的距离。从声音自声源发出之后,会随着距离发生一定的衰减。如此,使得声音感应装置感应到对应的声源生成的音频数据流,其表征的能量或声压值会相对较大。在一些情况下,第一音频数据块中的信号强度弱于第二音频数据块,如此可以理解为当前的声源应该与第二音频数据块所在的数据通道相对应,或者说,与生成第二音频数据块的声音感应装置相对应。可以得出,相对于第一数据通道来说,第一音频数据块中包括的音频数据可能不是源于第一数据通道对应的声源,或者,第一音频数据块中的至少部分音频数据分段不是源于第一数据通道对应的声源。通过上述分析,可以通过第一音频数据块与第二音频数据块中的信号强度,区分第一音频数据块中的音频数据分段是否源于第一数据通道对应的声源。
在一个实施方式中,所述第一音频数据块中的音频数据分段与所述第二音频数据块中的音频数据分段的相关系数,形成相关系数集;所述时差确定模块还可以实现:统计所述相关系数集中大于设定系数值的相关系数的统计数量;相应的,在所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度,且所述统计数量大于设定数量阈值的情况下,才将所述时间差确定为基准时差。
在本实施方式中,可以根据相关系数来区分第一音频数据块中的音频数据分段是有效数据或是噪音数据。具体的,可以通过将相关系数集中的相关系数与设定系数值进行 比较。相关系数大于设定系数值,可以认定相关系数对应的音频数据分段为有效数据。
在本实施方式中,统计数量大于设定数量阈值,可以理解为音频数据块中的有效数据的数量大于设定数量阈值。在一些情况下,如果统计数量小于设定数量阈值,可以认为该音频数据块中的有效数据非常少,可以不再进行进一步处理,以减少运算量。
在一个实施方式中,所述时差确定模块还可以实现:在所述第一音频数据块的信号强度弱于所述第二音频数据块的信号强度的情况下,将所述时间差确定为串音时差;相应的,所述处理模块在所述音频分段时差与所述串音时差相匹配的情况下,确定所述音频数据分段包括串音数据。
在本实施方式中,所述第一音频数据块的信号强度弱于所述第二音频数据块的信号强度的情况可以包括:所述第一音频数据块的能量小于对应的第二音频数据块的能量;或者,所述第一音频数据块的声压值小于对应第二音频数据块的声压值。
在本实施方式中,通过设置串音时差,可以直接对第一音频数据块的音频数据分段进行检测。从而判断第一音频数据块中的音频数据分段是不是源于不是与第一数据通道相对应的声源。
请参阅图4,本说明书实施方式提供一种串音数据检测系统。所述串音数据检测系统可以包括客户端和服务器。
在本实施方式中,所述客户端可以包括具有数据接收和发送能力的电子设备。所述客户端可以包括至少两个声音感应装置和网络通信单元。
在本实施方式中,所述声音感应装置可以用于感应声源所发出的声音并生成相应的音频数据。具体地,所述声音感应装置可以是一个传声器、或者是设置了传声器的麦克风。所述传声器用于将声音转换成电信号,得到音频数据流。所述每个声音感应装置可以对应一个数据通道,所述声音感应装置可以通过所述数据通道将声音感应装置生成的音频数据流提供给所述网络通信单元。具体地,至少两个声音感应装置可以包括第一声音感应装置和第二声音感应装置。相应地,所述第一声音感应装置可以对应第一数据通道,所述第二声音感应装置可以对应第二数据通道。
在本实施方式中,所述网络通信单元包括遵循网络通信协议进行网络数据通信的装置。所述网络通信单元可以接收所述声音感应装置提供的音频数据,也可以将所述音频数据发送给所述服务器。所述网络通信单元可以通过所述数据通道将接收到的音频数据发送给所述服务器。
在本实施方式中,所述客户端可以具有较弱数据处理能力,可以是类似物联网设备 等电子设备。所述客户端可以具有接收模块,以及发送模块。所述客户端的网络通信单元可以实现所述发送模块的功能。
在本实施方式中,所述服务器可以包括具有一定运算处理能力的电子设备,其可以具有网络通信单元、处理器和存储器等。当然,上述服务器也可以是指运行在所述电子设备中的软体。上述服务器还可以为分布式服务器,可以是具有多个处理器、网络通信模块等协同运作的系统,或者,服务器还可以为若干服务器形成的服务器集群。当然,所述服务器还可以为采用云计算技术实现。即服务器运行的功能模块采用云计算技术运行。
在本实施方式中,所述网络通信单元可以是遵循网络通信协议进行网络数据通信的装置。可以用于接收客户端提供的音频数据流。所述网络通信单元可以作为所述接收模块。
在本实施方式中,所述服务器可以具有接收模块、系数计算模块、时差确定模块和所述处理模块。其中,所述网络通信单元可以实现所述接收模块的功能。所述服务器的功能模块具体的内容,可以参照其他实施方式对照解释。
在本实施方式中,所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。
请参阅图5,本说明书实施方式还提供一种串音数据检测系统。所述串音检测系统可以包括客户端和服务器。
在本实施方式中,所述客户端可以包括至少两个声音感应装置、处理器和网络通信单元。本实施方式所述至少两个声音感应装置,其实现的具体功能可以参见其他实施方式,在此不再赘述。所述客户端可以是具有一定处理能的设备,具体地,例如,所述客户端可以是笔记本电脑,或者智能终端设备等。所述的网络通信单元可以实现接收模块,所述系数计算模块可以位于所述处理器中。所述网络通信单元可以是遵循网络通信协议进行网络数据通信的装置。
在本实施方式中,所述服务器的处理器可以运行有前文所述时差确定模块和所述处理模块。其具体的实现,可以参照其他实施方式对照解释。
当然,请参阅图6,在本实施方式中,还可以为所述客户端运行有所述系数计算模块和所述时差确定模块,将基准时差、第一音频数据块和第二音频数据块发送给服务器。 所述服务器可以仅仅运行所述处理模块。
本说明书实施方式还提供一种串音数据检测系统。为串音数据检测系统的交互示意图。所述串音检测系统可以包括客户端和服务器。
在本实施方式中,所述客户端可以包括有至少两个声音感应装置和处理器。本实施方式中所述的至少两个声音感应装置所实现的具体功能可以参见其他实施方式,在此不再赘述。所述客户端可以具有较强的处理能力。所述处理器可以运行所述系数计算模块、时差确定模块和处理模块。在此场景下,可以无需与服务器进行交互。或者,可以将处理模块处理之后的音频数据块提供给服务器。具体地,例如所述客户端可以是具有较高性能的平板电脑、笔记本电脑、台式电脑、工作站等。
当然,上述只是示例的方式列举了一些电子设备。随着科学技术进步,硬件设备的性能可能会有提升,使得目前数据处理能力较弱的电子设备,也可能具备较佳的数据处理能力。所以上述实施方式中,对软件模块运行于硬件设备中的划分,并不构成对本申请的限定。所属领域技术人员还可能对上述软件的模块进行进一步功能拆分,并相应的放置于客户端或服务器中运行。但只要其实现的功能和效果与本说明书相同或相似,均应涵盖于本申请保护范围内。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;根据所述第一音频数据块和所述第二音频数据块之间的相关系数,确定所述第一音频数据块中的目标音频数据分段,以及所述第二音频数据块中的辅助音频数据分段;其中,所述目标音频数据分段和所述辅助音频数据分段中至少部分数据源于同一声源;所述相关系数用于表示音频数据分段之间的相似程度;根据所述目标音频数据分段和所述辅助音频数据分段,确定所述第一音频数据块和所述第二音频数据块的基准时差;计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)或者存储卡(Memory Card)。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对 照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;根据所述第一音频数据块和所述第二音频数据块之间的相关系数,确定所述第一音频数据块中的目标音频数据分段,以及所述第二音频数据块中的辅助音频数据分段;其中,所述目标音频数据分段和所述辅助音频数据分段中至少部分数据源于同一声源;根据所述目标音频数据分段和所述辅助音频数据分段,确定所述第一音频数据块和所述第二音频数据块的基准时差;将所述基准时差、所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收第一音频数据块、第二音频数据块和基准时差;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;根据所述第一音频数据块和所述第二音频数据块之间的相关系数,确定所述第一音频数据块中的目标音频数据分段,以及所述第二音频数据块中的辅助音频数据分段;其中,所述目标音频数据分段和所述辅助音频数据分段中至少部分数据源于同一声源;将所述目标音频数据分段、辅助音频数据分段、所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器根据所述目标音频数据分段和所述辅助音频数据分段,确定所述第一 音频数据块和所述第二音频数据块的基准时差;计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差,在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收客户端提供的所述目标音频数据分段、辅助音频数据分段、所述第一音频数据块和所述第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据块;其中,所述目标音频数据分段选自于所述第一音频数据块,所述辅助音频数据分段选自于所述第二音频数据块;根据所述目标音频数据分段和所述辅助音频数据分段,确定所述第一音频数据块和所述第二音频数据块的基准时差;计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被执行时实现:接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;将所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器根据所述第一音频数据块和所述第二音频数据块之间的相关系数,确定所述第一音频数据块中的目标音频数据分段,以及所述第二音频数据块中的辅助音频数据分段,其中,所述目标音频数据分段和所述辅助音频数据分段中至少部分数据源于同一声源;根据所述目标音频数据分段和所述辅助音频数据分段,确定所述第一音频数据块和所述第二音频数据块的基准时差;计算所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的音频分段时差,在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
在本实施方式中,所述计算机存储介质实现的具体功能,可以参照其他实施方式对照解释。
上面对本说明书的各种实施方式的描述以描述的目的提供给本领域技术人员。其不旨在是穷举的、或者不旨在将本发明限制于单个公开的实施方式。如上所述,本说明书的各种替代和变化对于上述技术所属领域技术人员而言将是显而易见的。因此,虽然已经具体讨论了一些实施方式,但是其它实施方式将是显而易见的,或者本领域技术人员相对容易得出。本说明书旨在包括在此已经讨论过的本发明的所有替代、修改、和变化,以及落在上述申请的精神和范围内的其它实施方式。
在说明书各实施方式中“第一”、“第二”的表述仅为了区分不同的数据通道与音频数据块,在这里并不限定数据通道和音频数据块的数量。所述数据通道和音频数据块可以包括多个而不仅限于两个。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施方式或者实施方式的某些部分所述的方法。
本说明书中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。
本说明书可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、包括以上任何系统或设备的分布式计算环境等等。
虽然通过实施方式描绘了本说明书,本领域普通技术人员知道,本说明书有许多变形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。

Claims (18)

  1. 一种串音数据检测方法,其特征在于,包括:
    接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;
    计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;
    将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;
    将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;
    在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
  2. 根据权利要求1所述的方法,其特征在于,在计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值的步骤中包括:
    计算所述第一音频数据块中的音频数据分段与所述第二音频数据块中的音频数据分段的相关系数,形成相关系数集;将所述相关系数集中的最大值作为所述峰值。
  3. 根据权利要求1所述的方法,其特征在于,在计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值步骤中,所述峰值的数量为二个;
    相应的,在将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差的步骤中包括:
    分别计算所述二个峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差,分别为第一时间差和第二时间差;其中,所述第一时间差和所述第二时间差之中较小的作为所述基准时差。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    将所述第一时间差和所述第二时间差之中较大的作为串音时差;
    相应的,所述方法还包括:在所述音频分段时差与所述串音时差相匹配的情况下,确定所述音频数据分段包括串音数据。
  5. 根据权利要求4所述的方法,其特征在于,所述音频分段时差与所述串音时差相 匹配的情况,包括:
    所述音频分段时差与所述串音时差相等;或者,
    所述音频分段时差与所述串音时差之间的差值小于第一指定阈值。
  6. 根据权利要求1所述的方法,其特征在于,所述音频分段时差与所述基准时差不相匹配的情况,包括:
    所述音频分段时差与所述基准时差不相等;或者,
    所述音频分段时差与所述基准时差之间的差值大于第二指定阈值。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在相关系数大于设定系数值的情况下,标记所述相关系数对应的第一音频数据块中的音频数据分段为有效数据;
    相应的,在将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差的步骤中,仅在所述音频数据分段被标记为有效数据的情况下才将所述时间差作为音频分段时差。
  8. 根据权利要求1所述的方法,其特征在于,在计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值的步骤中,所述峰值的数量为一个;
    相应的,在将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差的步骤中包括:
    在所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度的情况下,将所述时间差确定为基准时差。
  9. 根据权利要求8所述的方法,其特征在于,所述第一音频数据块中的音频数据分段与所述第二音频数据块中的音频数据分段的相关系数,形成相关系数集;所述方法还包括:
    统计所述相关系数集中大于设定系数值的相关系数的统计数量;
    相应的,在所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度的情况下,将所述时间差确定为基准时差的步骤中,包括:在所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度,且所述统计数量大于设定数量阈值的情况下,才将所述时间差确定为基准时差。
  10. 根据权利要求8所述的方法,其特征在于,所述第一音频数据块的信号强度高于所述第二音频数据块的信号强度的情况,包括:
    所述第一音频数据块的能量大于对应的第二音频数据块的能量;或者,所述第一音频数据块的声压值大于对应第二音频数据块的声压值。
  11. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    在所述第一音频数据块的信号强度弱于所述第二音频数据块的信号强度的情况下,将所述时间差确定为串音时差;
    相应的,在所述音频分段时差与所述串音时差相匹配的情况下,确定所述音频数据分段包括串音数据。
  12. 根据权利要求11所述的方法,其特征在于,所述第一音频数据块的信号强度弱于所述第二音频数据块的信号强度的情况,包括:所述第一音频数据块中音频数据的能量小于对应的第二音频数据块中音频数据的能量;或者,所述第一音频数据块中音频数据的声压值小于对应第二音频数据块中音频数据的声压值。
  13. 一种电子设备,其特征在于,包括:
    第一声音感应装置,用于产生第一音频数据块;所述第一音频数据块中包括多个音频数据分段;
    第二声音感应装置,用于产生第二音频数据块;所述第二音频数据块中包括多个音频数据分段;
    处理器,用于计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
  14. 一种串音数据检测方法,其特征在于,包括:
    接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;
    计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;
    将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;
    将所述基准时差、所述第一音频数据块和所述第二音频数据块发送给服务器,以用 于所述服务器将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
  15. 一种串音数据检测方法,其特征在于,包括:
    接收第一音频数据块、第二音频数据块和基准时差;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;
    将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;
    在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
  16. 一种串音数据检测方法,其特征在于,包括:
    接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;
    计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;
    将所述峰值、第一音频数据块和第二音频数据块发送给服务器,以用于所述服务器将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
  17. 一种串音数据检测方法,其特征在于,包括:
    接收客户端提供的相关系数的峰值、第一音频数据块和第二音频数据块;其中,所述峰值是所述第一音频数据块的音频数据分段与所述第二音频数据块的音频数据分段的相关系数的峰值;
    将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;
    将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;
    在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块 的相应音频数据分段包括串音数据。
  18. 一种串音数据检测方法,其特征在于,包括:
    接收第一音频数据块和第二音频数据块;其中,所述第一音频数据块和所述第二音频数据块分别包括多个音频数据分段;
    将所述第一音频数据块和所述第二音频数据块发送给服务器,以用于所述服务器计算所述第一音频数据块的音频数据分段与第二音频数据块的音频数据分段的相关系数,得出相关系数的峰值;将所述峰值对应的所述第一音频数据块中的音频数据分段与第二音频数据块中的音频数据分段的获取时间的时间差作为基准时差;将所述第一音频数据块的音频数据分段与第二音频数据块中对应的音频数据分段的获取时间的时间差作为音频分段时差;在所述音频分段时差与所述基准时差不相匹配的情况下,确定所述第一音频数据块的相应音频数据分段包括串音数据。
PCT/CN2019/094530 2018-07-12 2019-07-03 串音数据检测方法和电子设备 WO2020011085A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021500297A JP2021531685A (ja) 2018-07-12 2019-07-03 クロストークデータ検出方法および電子デバイス
US17/111,341 US11551706B2 (en) 2018-07-12 2020-12-03 Crosstalk data detection method and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810763010.9 2018-07-12
CN201810763010.9A CN110718237B (zh) 2018-07-12 2018-07-12 串音数据检测方法和电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/111,341 Continuation US11551706B2 (en) 2018-07-12 2020-12-03 Crosstalk data detection method and electronic device

Publications (1)

Publication Number Publication Date
WO2020011085A1 true WO2020011085A1 (zh) 2020-01-16

Family

ID=69141849

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/094530 WO2020011085A1 (zh) 2018-07-12 2019-07-03 串音数据检测方法和电子设备

Country Status (4)

Country Link
US (1) US11551706B2 (zh)
JP (1) JP2021531685A (zh)
CN (1) CN110718237B (zh)
WO (1) WO2020011085A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2613898A (en) * 2021-12-20 2023-06-21 British Telecomm Noise cancellation

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110718237B (zh) 2018-07-12 2023-08-18 阿里巴巴集团控股有限公司 串音数据检测方法和电子设备
CN113539269A (zh) * 2021-07-20 2021-10-22 上海明略人工智能(集团)有限公司 音频信息处理方法、系统和计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006039108A (ja) * 2004-07-26 2006-02-09 Nippon Hoso Kyokai <Nhk> 特定話者音声出力装置及び特定話者判定プログラム
WO2017064840A1 (ja) * 2015-10-16 2017-04-20 パナソニックIpマネジメント株式会社 音源分離装置および音源分離方法
TW201732785A (zh) * 2016-01-18 2017-09-16 博姆雲360公司 用於音訊重現之次頻帶空間及串音消除
CN107316651A (zh) * 2017-07-04 2017-11-03 北京中瑞智科技有限公司 基于麦克风的音频处理方法和装置

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07336790A (ja) * 1994-06-13 1995-12-22 Nec Corp マイクロホンシステム
SE519981C2 (sv) * 2000-09-15 2003-05-06 Ericsson Telefon Ab L M Kodning och avkodning av signaler från flera kanaler
JP3750583B2 (ja) * 2001-10-22 2006-03-01 ソニー株式会社 信号処理方法及び装置、並びに信号処理プログラム
GB2391322B (en) * 2002-07-31 2005-12-14 British Broadcasting Corp Signal comparison method and apparatus
CN101346896B (zh) 2005-10-26 2012-09-05 日本电气株式会社 回声抑制方法及设备
US8260613B2 (en) * 2007-02-21 2012-09-04 Telefonaktiebolaget L M Ericsson (Publ) Double talk detector
WO2010092913A1 (ja) * 2009-02-13 2010-08-19 日本電気株式会社 多チャンネル音響信号処理方法、そのシステム及びプログラム
KR101670313B1 (ko) * 2010-01-28 2016-10-28 삼성전자주식회사 음원 분리를 위해 자동적으로 문턱치를 선택하는 신호 분리 시스템 및 방법
US20130156238A1 (en) 2011-11-28 2013-06-20 Sony Mobile Communications Ab Adaptive crosstalk rejection
EP2645362A1 (en) 2012-03-26 2013-10-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for improving the perceived quality of sound reproduction by combining active noise cancellation and perceptual noise compensation
CN103268766B (zh) * 2013-05-17 2015-07-01 泰凌微电子(上海)有限公司 双麦克风语音增强方法及装置
US9794888B2 (en) 2014-05-05 2017-10-17 Isco International, Llc Method and apparatus for increasing performance of a communication link of a communication node
US10127006B2 (en) 2014-09-09 2018-11-13 Sonos, Inc. Facilitating calibration of an audio playback device
US9747906B2 (en) 2014-11-14 2017-08-29 The Nielson Company (Us), Llc Determining media device activation based on frequency response analysis
US9672805B2 (en) 2014-12-12 2017-06-06 Qualcomm Incorporated Feedback cancelation for enhanced conversational communications in shared acoustic space
US9747656B2 (en) 2015-01-22 2017-08-29 Digimarc Corporation Differential modulation for robust signaling and synchronization
EP3257236B1 (en) 2015-02-09 2022-04-27 Dolby Laboratories Licensing Corporation Nearby talker obscuring, duplicate dialogue amelioration and automatic muting of acoustically proximate participants
CN104810025B (zh) * 2015-03-31 2018-04-20 天翼爱音乐文化科技有限公司 音频相似度检测方法及装置
US10839309B2 (en) * 2015-06-04 2020-11-17 Accusonus, Inc. Data training in multi-sensor setups
CN107040843B (zh) * 2017-03-06 2021-05-18 联想(北京)有限公司 通过两个麦克风获取同一个音源的方法及采集设备
CN106997769B (zh) * 2017-03-25 2020-04-24 腾讯音乐娱乐(深圳)有限公司 颤音识别方法及装置
CN110718237B (zh) 2018-07-12 2023-08-18 阿里巴巴集团控股有限公司 串音数据检测方法和电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006039108A (ja) * 2004-07-26 2006-02-09 Nippon Hoso Kyokai <Nhk> 特定話者音声出力装置及び特定話者判定プログラム
WO2017064840A1 (ja) * 2015-10-16 2017-04-20 パナソニックIpマネジメント株式会社 音源分離装置および音源分離方法
TW201732785A (zh) * 2016-01-18 2017-09-16 博姆雲360公司 用於音訊重現之次頻帶空間及串音消除
CN107316651A (zh) * 2017-07-04 2017-11-03 北京中瑞智科技有限公司 基于麦克风的音频处理方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2613898A (en) * 2021-12-20 2023-06-21 British Telecomm Noise cancellation

Also Published As

Publication number Publication date
CN110718237A (zh) 2020-01-21
CN110718237B (zh) 2023-08-18
US11551706B2 (en) 2023-01-10
US20210090589A1 (en) 2021-03-25
JP2021531685A (ja) 2021-11-18

Similar Documents

Publication Publication Date Title
WO2020011085A1 (zh) 串音数据检测方法和电子设备
US10482890B2 (en) Determining media device activation based on frequency response analysis
CN110678922A (zh) 闪避和擦除来自附近设备的音频
JP6116038B2 (ja) 番組識別のためのシステムおよび方法
EP3493198B1 (en) Method and device for determining delay of audio
EP3350804B1 (en) Collaborative audio processing
JP2014523003A (ja) オーディオ信号処理
US10887034B2 (en) Methods and apparatus for increasing the robustness of media signatures
US20140125582A1 (en) Gesture Recognition Apparatus and Method of Gesture Recognition
US9774743B2 (en) Silence signatures of audio signals
CN110718238B (zh) 串音数据检测方法、客户端和电子设备
CN109314933B (zh) 具有多功率电平的、基于跳过相关的对称载波侦听
JP2022185114A (ja) エコー検出
US10204634B2 (en) Distributed suppression or enhancement of audio features
CN114788304A (zh) 用于减少环境噪声补偿系统中的误差的方法
WO2018160436A1 (en) Audio data transmission using frequency hopping
WO2022052965A1 (zh) 语音重放攻击检测方法、装置、介质、设备及程序产品
GB2609171A (en) Voice authentication device
US20230238008A1 (en) Audio watermark addition method, audio watermark parsing method, device, and medium
US11265650B2 (en) Method, client, and electronic device for processing audio signals
CN111343660A (zh) 一种应用程序的测试方法及设备
CN108924465A (zh) 视频会议发言人终端的确定方法、装置、设备和存储介质
Zhang et al. Speaker Orientation-Aware Privacy Control to Thwart Misactivation of Voice Assistants
CN112542178B (zh) 音频数据处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19833855

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021500297

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19833855

Country of ref document: EP

Kind code of ref document: A1