CN113891152A - Audio playing control method and device, equipment, medium and product thereof - Google Patents

Audio playing control method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN113891152A
CN113891152A CN202111146341.6A CN202111146341A CN113891152A CN 113891152 A CN113891152 A CN 113891152A CN 202111146341 A CN202111146341 A CN 202111146341A CN 113891152 A CN113891152 A CN 113891152A
Authority
CN
China
Prior art keywords
reference signal
audio data
audio
real
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111146341.6A
Other languages
Chinese (zh)
Inventor
何鑫
苏嘉昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huaduo Network Technology Co Ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN202111146341.6A priority Critical patent/CN113891152A/en
Publication of CN113891152A publication Critical patent/CN113891152A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application relates to the technical field of audio signal processing, and discloses an audio playing control method, an audio playing control device, audio playing control equipment, audio playing control media and an audio playing control product, wherein the method comprises the following steps: continuously acquiring real-time audio data generated by a local external input sound source in the live webcasting process; detecting a reference signal which is added to the audio data of the local playing sound source in advance and looped back by the external playing, and determining a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear; controlling audio data of a local playing sound source to be mixed with the real-time audio data after being superposed with the loopback delay value to obtain corrected audio data; and outputting the corrected audio data as an audio stream of the network live broadcast. According to the method and the device, the audio loopback delay is calculated in real time, additional operation of a user is not needed, the problem that the voice is aligned with the accompaniment in scenes of live webcasting, remote karaoke and live singing is guaranteed, and user experience is improved.

Description

Audio playing control method and device, equipment, medium and product thereof
Technical Field
The present application relates to the field of audio signal processing technologies, and in particular, to an audio playback control method and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.
Background
In karaoke and live-broadcast singing scenes on terminal equipment, the equipment plays music accompaniment and plays singing music, and then the master singing lyrics and the equipment play music accompaniment are mixed and sent to a far end, as shown in fig. 1.
In fig. 1, the interval between the mixing time T4 and the acquisition time T3 is short, and T4 may be regarded as T3. Because the sound time of collecting the main broadcasting is T3, and the music playing time is T1, if there is no time delay compensation or the compensation value is inaccurate when mixing sound, the lyrics of the voice singing time T1 appear at the far end, but the accompaniment plays the accompaniment of other moments, namely the voice is asynchronous with the accompaniment, which affects the user experience.
The value of the time delay compensation is the value of the audio loopback delay, and the loopback delay value refers to the time interval (T3-T1 in the figure) when the sound played outside the terminal equipment is collected by the equipment again. Different audio interfaces, different platforms (Android/IOS/MAC/WIN), different vendor equipment audio loopback delays are different, and therefore a technique is needed to calculate the audio loopback delay. Generally, when the error of the delay compensation is less than 50ms, the human ear cannot experience the desynchronization of the voice and the accompaniment, and therefore, the error of the calculation method of the audio loopback delay is generally controlled within 50 ms.
In the prior art, a large number of methods for estimating a loopback delay value exist, and the methods have advantages and disadvantages, for example, a larsen test method and a delay measurement method in WEBRTC both need to measure delay in advance, cannot be performed in real time, and also have the situations of inaccurate detection or undetected detection in a noisy environment, and the detection precision is not high.
Therefore, the applicant also made corresponding researches in consideration of innovation.
Disclosure of Invention
A primary object of the present application is to solve at least one of the above problems and provide an audio playback control method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
an audio playing control method adapted to one of the objectives of the present application includes the following steps:
continuously acquiring real-time audio data generated by a local external input sound source in the live webcasting process;
detecting a reference signal which is added to the audio data of the local playing sound source in advance and looped back by the external playing, and determining a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear;
controlling audio data of a local playing sound source to be mixed with the real-time audio data after being superposed with the loopback delay value to obtain corrected audio data;
and outputting the corrected audio data as an audio stream of the network live broadcast.
In a further embodiment, the method for continuously acquiring real-time audio data generated by a local external input sound source in a live webcasting process comprises the following steps:
acquiring audio data corresponding to a local playing sound source for playing;
continuously collecting external voice signals generated by an external input sound source;
converting the external voice signal into the real-time audio data.
In a further embodiment, detecting a reference signal that is externally looped back in audio data that is pre-added to the local playback audio source, and determining a corresponding loopback delay value includes the following steps:
constructing the reference signal;
adding the reference signal to the audio data of a local playing sound source at a first moment for playing out;
starting detection of the reference signal in the real-time audio data, and determining a second moment when the reference signal is detected;
and determining the difference value between the second time and the first time as the loopback delay value.
In an extended embodiment, after initiating the detection of the reference signal in the real-time audio data, the method comprises the following steps:
tracking a noise signal along a time domain for the real-time audio data;
transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;
and detecting the reference signal according to the voice energy spectrum data mapped by the voice frame of the real-time audio data and the noise energy spectrum data.
In an embodiment, the detecting the reference signal according to the speech energy spectrum data mapped by the speech frame of the real-time audio data and the noise energy spectrum data comprises:
positioning the peak position of each frequency point according to the voice energy spectrum data mapped by each voice frame in the real-time audio data;
calculating the existence probability of the reference signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;
and judging to detect the reference signal when the existence probability of a plurality of continuous voice frames meets a preset condition.
In a preferred embodiment, the reference signal comprises a plurality of single-frequency signals, and each single-frequency signal is arranged at equal intervals in a frequency domain.
In an extended embodiment, after initiating the detection of the reference signal in the real-time audio data, the method comprises the following steps:
in response to an event of failure to detect the reference signal, reconstructing the reference signal and initiating a secondary detection, the reconstructed reference signal being a high frequency signal outside the hearing band of the human ear having a frequency lower than the frequency of the previously constructed reference signal.
An audio playback control apparatus adapted to one of the objects of the present application includes:
the audio acquisition module is used for continuously acquiring real-time audio data generated by a local external input sound source in the live network broadcast process; (ii) a
The loopback detection module is used for detecting a reference signal which is added to the audio data of the local playing sound source in advance and is looped back by external playing, and determining a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear;
the audio mixing correction module is used for controlling audio data of a local playing sound source to be mixed with the real-time audio data after being superposed with the loopback delay value so as to obtain corrected audio data;
and the audio stream pushing module is used for outputting the corrected audio data as an audio stream of the network live broadcast.
In a further embodiment, the audio acquisition module comprises:
the local play-out submodule is used for acquiring audio data corresponding to a local play sound source and playing the audio data out;
the local acquisition submodule is used for continuously acquiring external voice signals generated by an external input sound source;
and the voice conversion submodule is used for converting the external voice signal into the real-time audio data.
In a further embodiment, the loopback detection module comprises:
a signal construction sub-module for constructing the reference signal;
the signal presetting submodule is used for adding the reference signal to the audio data of the local playing sound source at a first moment for external playing;
the signal detection submodule starts the detection of the reference signal in the real-time audio data and determines a second moment when the reference signal is detected;
and the delay calculation submodule is used for determining the difference value between the second moment and the first moment as the loopback delay value.
In an expanded embodiment, the signal detection sub-module includes:
the noise tracking unit is used for tracking a noise signal along a time domain for the real-time audio data;
the noise transformation unit is used for transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;
and the signal measuring and calculating unit is used for detecting the reference signal according to the voice energy spectrum data mapped by the voice frame of the real-time audio data and the noise energy spectrum data.
In an embodied embodiment, the signal evaluation unit comprises:
the peak positioning subunit is used for positioning the peak position of each frequency point according to the voice energy spectrum data mapped by each voice frame in the real-time audio data;
the probability evaluation subunit is used for calculating the existence probability of the reference signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;
and the signal judgment subunit is used for judging that the reference signal is detected when the existence probability of a plurality of continuous voice frames meets a preset condition.
In a preferred embodiment, the reference signal comprises a plurality of single-frequency signals, and each single-frequency signal is arranged at equal intervals in a frequency domain.
In an expanded embodiment, the signal detection sub-module further includes:
and the restart detection unit is used for reconstructing the reference signal and starting secondary detection in response to the event of failure of detecting the reference signal, wherein the reconstructed reference signal is a high-frequency signal outside the hearing frequency band of the human ear, and the frequency of the reconstructed reference signal is lower than that of the previously constructed reference signal.
The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the audio playing control method.
A computer-readable storage medium, which stores a computer program implemented according to the audio playback control method in the form of computer-readable instructions, executes the steps included in the method when the computer program is called by a computer.
A computer program product adapted to another object of the present application is provided, which comprises computer program/instructions, which when executed by a processor, implement the steps of the audio playback control method described in any of the embodiments of the present application.
Compared with the prior art, the application has the following advantages:
the application presets high-frequency signals which can not be identified by human ears in the audio data corresponding to the local playing sound source as reference signals, determines the loopback delay value of the loopback of the audio data after the audio data is played, then the loopback delay value is fed back to the audio mixing link at the acquisition side, the alignment of two paths of audio data corresponding to two types of local playing sound sources and external input sound sources is realized by utilizing the loopback delay value, the corrected audio data is obtained to carry out stream pushing output, so that the sound of the two types of local playing sound sources and the external input sound source are synchronized in time, the reference signal is detected when the real-time audio data is collected after the audio data of the local playing audio is added in the process of playing the local playing audio, so that various factors of the terminal equipment can be adapted to individualize and accurately determine the corresponding loopback delay value of the communication equipment, and therefore, the corrected audio data can achieve a more accurate sound source synchronization effect.
In the application, the reference signal is an out-of-band high-frequency signal outside a hearing frequency band range which cannot be sensed by human ears, and a general voice signal is in-band, so that the reference signal has a self-characteristic, the voice quality and the hearing of played voice or music cannot be influenced after the reference signal is externally played, the reference signal is also easily and accurately identified by a computer program, the success rate of detection is obviously improved, and the loopback delay value can be efficiently and accurately determined so as to be used for correcting audio data.
The application can be applied to application scenes with online entertainment properties, including but not limited to scenes such as network video live broadcast, remote karaoke, network remote teaching and the like, under the scenes, the speaker realizes the output of corrected real-time audio data based on the loopback delay value obtained by the application, after the loopback delay value is transmitted to corresponding remote user terminal equipment, a listener can obtain the sound effect of synchronous playing of various sound sources, the perception from the angle of the listener is ensured, the sound effect that the sound production of the singer keeps synchronous with lyrics and background music in the video can be perceived, and therefore the user experience is improved.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic diagram of the principle of calculating the loopback delay value of an audio signal, where T1 to T4 represent different time instants, the acquisition thread is responsible for audio data acquisition, the playing thread is responsible for audio data conversion and playing, and the anchor indicates an external input sound source;
fig. 2 is a flowchart illustrating an exemplary embodiment of an audio playback control method according to the present application;
fig. 3 is a flowchart illustrating a process of determining a loopback delay value in an embodiment of the present application;
FIG. 4 is a flowchart illustrating a process of detecting a reference signal from audio data according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a process of detecting a reference signal according to a speech frame according to an embodiment of the present application;
FIG. 6 is a functional block diagram of an exemplary embodiment of an audio playback control apparatus of the present application;
fig. 7 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
The audio playing control method can be applied to terminal equipment in an off-line environment and terminal equipment in a scene that a server supports instant messaging, wherein the scene comprises but is not limited to various exemplary application scenes such as instant messaging, live webcasting, online customer service, remote karaoke, remote singing live webcasting, remote voice teaching and the like, and the audio playing control method has a wide application range as a basic technology of audio processing. The method can be implemented by programming as a computer program product, and is deployed in a terminal device for operation.
Referring to fig. 2 and fig. 1, in an exemplary embodiment of an audio playback control method of the present application, the method includes the following steps:
step S1100, continuously acquiring real-time audio data generated by a local external input sound source in the network live broadcast process;
in an exemplary application scenario for facilitating understanding of the following description, in a webcast room, a terminal device where a host user is located plays a song and includes background music accompanying the song, where the background music may be played by the host user calling a local music file or played by streaming from a remote server, and when the background music is played, corresponding lyrics are displayed on a display, and the host user in the webcast room holds a microphone to focus on the lyrics to guide a high song, and audio data corresponding to the background music is used as audio data corresponding to a local play sound source generated by a local player, and is locally played, and is also transmitted to a server supporting operation of the webcast room, and the live broadcast is listened to by the user in the webcast room.
In such a scene, at least two sound sources are included, and two audio data are correspondingly generated, wherein the first sound source is embodied by playing corresponding background music by the terminal equipment, namely the local playing sound source; the second sound source is embodied by an external voice signal received by a microphone of the terminal device, namely the external input sound source. The audio data generated corresponding to the first sound source is first audio data, the audio data generated corresponding to the second sound source is second audio data, and on one hand, the terminal equipment performs sound mixing on the two paths of audio data through one acquisition thread to form sound-mixed audio data and pushes the sound-mixed audio data to a live broadcast room; on the other hand, through a playing thread, the audio data of the local playing sound source and the audio data of other possible sound sources are mixed and then are played locally.
Other exemplary application scenarios, like the remote speech teaching scenario described above, for example, the anchor user may use other background audio instead of the background music, but still essentially constitute a local playback audio source.
The playing and collecting of the audio data by the terminal device are continuously and synchronously performed, and can be flexibly implemented by a person skilled in the art, and in a preferred embodiment, the playing and collecting of the audio data can be performed according to the following steps:
step S1110, obtaining audio data corresponding to the local sound source for playing:
as described above, the audio data corresponding to the local playing sound source may be called from a local file of the local terminal device, and may be played by a player, or may be obtained by pulling a stream from a remote server.
Specifically, when the audio data of the local playing sound source is obtained, the audio data can be collected from a sound card of the local terminal device, or the audio data can be obtained by directly decoding a corresponding audio file.
Step S1120, continuously collecting an external voice signal generated by an external input sound source:
the sound card of the terminal equipment can continuously collect external voice signals, so that the voice signals input by input equipment such as a microphone can be collected through the sound card in the same way.
Step S1130, converting the external voice signal into the real-time audio data:
and the sound card performs sampling conversion on the collected voice signals to form corresponding second audio data, namely the real-time audio data relative to the external input sound source. Therefore, in this alternative embodiment, the external input sound source is mainly represented as a voice input form obtained by self-sampling by a sound card of the terminal device.
The external input sound source is not limited to an external input form such as a microphone, and may also be understood as a form in which another device transmits an external voice signal in an electrical signal form or directly transmits corresponding audio data, and the technical solution of the present application may be applied as long as there is a need for calculating a loopback delay value.
Step S1200, detecting a reference signal which is added to the audio data of the local playing sound source in advance and is looped back by the external playing, and determining a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear:
in order to detect a delay value generated by loopback after first audio data is played by a terminal device, namely the loopback delay value, in the service logic of a playing side, a reference signal is preset in the first audio data, namely the audio data corresponding to a local playing sound source, the time length of the echo signal played by the reference signal is detected in real-time audio data, and the time length is determined as the loopback delay value and used as the acquisition side for correcting the audio data to realize the alignment of the audio data of multiple sound sources.
Because the loopback delay value mainly depends on the hardware performance of the terminal equipment, and the loopback delay value of the same terminal equipment is relatively fixed, after the loopback delay value is obtained, the loopback delay value can be stored locally in the terminal equipment and can be directly called subsequently.
The reference signal, which is self-defined by the present application, is constructed as a signal having certain regularity and uniqueness, so that it is convenient to distinguish from the signal content corresponding to the audio data of the local play sound source. In the present application, the reference signal is defined as an out-of-band high-frequency signal of a frequency band corresponding to the hearing range of the human ear. The frequency that human ear can receive is between 20Hz-20000Hz, and individual difference from person to person, and out-of-band signal is not perceptible to human ear. Therefore, in the application scene adapted by the application, the high-frequency signal outside the hearing frequency band of the human ear is selected as the reference signal, the requirement of tone quality control is fully considered, the voice information content of each sound source cannot be damaged, and the user cannot be disturbed.
In a preferred embodiment of the present application, the reference signal is configured to include a plurality of single-frequency signals, and each single-frequency signal is equally spaced in the frequency domain, for example, two adjacent single-frequency signals may be spaced by one, two, or three sampling resolution units. By adopting the mode, the detection is relatively easy, the corresponding detection algorithm has high operation efficiency and occupies little time.
In an embodiment of alternative implementation, the single-frequency signals of the reference signal may also be arranged at equal intervals, and the reference signal may also have other forms of structures similarly, as long as it is detected by applying a corresponding algorithm in the service logic of the acquisition side subsequently.
Step 1300, controlling the audio data of the local playing sound source to be mixed with the real-time audio data after being superposed with the loopback delay value, and obtaining corrected audio data:
the loopback delay value corresponds to the functional block diagram shown in fig. 1, that is, the difference value between T3 and T1 is used to perform delay compensation on the audio data corresponding to the local playing sound source at the acquisition side, so as to achieve alignment of the local playing sound source and the external input sound source on the audio data at the acquisition side audio mixing stage, and after corrected audio data obtained after audio mixing is pushed and analyzed and played by a remote device, the sound production of different sound sources is basically played synchronously in the time domain, or at least considered as being synchronous in hearing.
Therefore, after the loopback delay value is determined, the service logic at the acquisition side can be responsible for implementing delay compensation, and in the audio mixing stage at the acquisition side, the loopback delay value is superposed on the audio data generated by the local playing sound source, and then the audio data is mixed with the audio data generated by the external input sound source, so that the alignment relation among a plurality of sound sources is corrected, and the corrected audio data is obtained.
Step S1400, outputting the corrected audio data as an audio stream of the network live broadcast:
because the reference signal that this application constructed has more characteristics, is changeed and is detected to can confirm with high efficiency and accuracy the loopback delay value, consequently, the correction back audio data convert the speech signal into and play the back, in the sense of hearing, can obviously perceive synchronism and harmony between the different sound sources, tone quality is better.
In the application, in order to detect the reference signal, the playing of the audio data of the local playing sound source is suitably performed in an external mode, so that the reference signal is acquired from the echo signal in the real-time audio data after the real-time audio data is acquired through the acquisition side. Of course, once the loopback delay value is determined and stored, the subsequent call can be directly performed, and the external playback sound signal of the audio data of the local playing sound source does not need to be used.
For an instant messaging scene such as a live broadcast room exemplified in the present application, the corrected audio data needs to be included in a live broadcast stream as an audio stream and transmitted to a server of the live broadcast room, and then pushed by the server to a terminal device of an audience user of the live broadcast room for playing. It can be understood that the audience user plays and outputs according to the corrected audio data, and the playing effect of the alignment of the multiple sound sources can be obtained.
In an embodiment of this application extension, in order to ensure a success rate of detecting a reference signal in the method, an implementation environment of this application may be detected in advance, and specifically, the following steps may be performed:
detecting whether the terminal equipment is in an audio data acquisition state and a play-out state at the same time, wherein the play-out state means that the local terminal equipment starts a loudspeaker such as a sound box and the like which can conveniently acquire echo signals, and the loudspeaker is not used for carrying out speaker playing through equipment such as an earphone and the like;
detecting whether the sampling activity implemented by the terminal equipment is performed at a preset sampling rate, for example, a sampling rate of 44.1kHz or 48kHz is preset;
and detecting whether the terminal equipment has an audio loopback path corresponding to the local playing sound source, and if so, presetting the reference signal and detecting the reference signal according to the audio loopback path.
Through the detection, when confirming that corresponding conditions are all satisfied, the terminal device simultaneously processes the acquisition state and the play-out state, samples at a preset sampling rate, and forms an audio loop-back path through the play-out, under the condition, the process of detecting the reference signal can be implemented.
The method comprises the steps of presetting a high-frequency signal which can not be identified by human ears in audio data of a local playing sound source by an external playing thread as a reference signal, determining a loopback delay value of the loopback of the audio data of the local playing sound source after being externally played according to the reference signal, feeding the loopback delay value back to a sound mixing link of an acquisition thread, realizing the alignment of two types of sound sources by utilizing the loopback delay value, obtaining corrected audio data for stream pushing, and still realizing the basic synchronization of a plurality of sound sources in time when the corrected audio data is played in terminal equipment of an audience user in a live broadcast room, or basically coordinating and synchronizing the plurality of sound sources in time after being recognized by the user, wherein the reference signal is detected by the acquisition thread according to echo signals contained in real-time audio data after the external playing thread is added in real time in the audio data corresponding to the local playing sound source, therefore, the method can adapt to various factors of the terminal equipment to determine the corresponding loopback delay value individually and accurately, and therefore the corrected audio data can achieve a more accurate sound source synchronization effect.
In the application, the reference signal is an out-of-band high-frequency signal outside a hearing frequency band range which cannot be sensed by human ears, and a general voice signal is in-band, so that the reference signal has a self-characteristic, the voice quality and the hearing of played voice or music cannot be influenced after the reference signal is externally played, the reference signal is also easily and accurately identified by a computer program, the success rate of detection is obviously improved, and the loopback delay value can be efficiently and accurately determined so as to be used for correcting audio data.
The application can be applied to application scenes with online entertainment properties, including but not limited to network live broadcast and remote karaoke, remote voice teaching, remote singing live broadcast and other scenes, under such scenes, when the audio is output to remote terminal equipment after being corrected based on the loopback delay value obtained by the application, a listener can obtain the sound effect of synchronous playing of multiple sound sources, perception from the angle of the listener is ensured, the sound effect that the sound production of a singer keeps synchronous with background audio or background lyrics can be perceived, and therefore user experience is improved.
Referring to fig. 3, in a further embodiment, the step S1200 of detecting a reference signal that is looped back by being externally played in audio data that is pre-added to the local playback audio source, and determining a corresponding loopback delay value includes the following steps:
step S1210, constructing the reference signal:
the reference signal is configured to be a plurality of single-frequency signals at a high frequency, for example, three single-frequency signals, and the frequency point of each single-frequency signal is outside the hearing frequency band perceivable by human ears, as follows:
Figure BDA0003285648870000121
in the formula f0,f1,f2Is a reference signal frequency, wherein:
f1=f0+2Δf
f2=f0+4Δf
Δ f is the frequency resolution, i.e. the current audio sampling rate fsIn the case where the number of DFT (discrete Fourier transform) points is N
Figure BDA0003285648870000122
α0For the amplitude of the reference signal, e.g. 16bit quantized audio, α0The reference value is 8192. The single frequency signal duration is typically 200 ms.
In a preferred embodiment, considering the existence of the audio loop-back path, in order to better detect the added single-frequency signal during the detection at the acquisition side, a mute signal ref _ silence (t) may be added for a period of time, for example, 2 seconds, before the reference signal is added at the playback side, so as to delay the start of the detection at the acquisition side. To this end, in the service logic of the broadcast side, the reference signal can be expressed as:
Figure BDA0003285648870000123
step S1220, adding the reference signal to the audio data of the local sound source at the first time for playing out;
assuming that the audio data of the local playing sound source to be played by the terminal equipment is render (t), render (t) and reference signal ref0And (t) performing mixed sound playing, so that the operation of adding the reference signal during real-time playing is completed, and correspondingly, the time at this time is marked as the first time.
Step S1230, starting detection of the reference signal in the real-time audio data, and determining a second time when the reference signal is detected:
and after the reference signal is played along with the audio data of the local playing sound source by the external thread, the acquisition thread starts to detect the reference signal in real time. It is understood that the detection of the reference signal requires corresponding processing by designing a corresponding algorithm corresponding to the characteristics of the reference signal, although the specific implementation of the corresponding algorithms may be various, so as to accurately identify the characteristics of the reference signal and determine the fact that the reference signal exists. The implementation and application of the algorithm corresponding to the reference signal in the present embodiment will be further disclosed in the following through other embodiments, which will not be detailed here.
After the reference signal has been detected by means of these corresponding algorithms, the time at which it was detected can be determined and marked as the second moment.
And step S1240, determining the difference value between the second moment and the first moment as the loopback delay value.
Referring to fig. 1 again, it can be seen that, since the first time is a time when the playing side adds a reference signal to the audio data of the local playing sound source to be output, and the second time is a time when the acquisition side detects the reference signal, a difference between the second time and the first time is an actual time when the reference signal has finished the audio loopback path, and therefore, the difference between the second time and the first time is determined as the loopback delay value.
In this embodiment, an exemplary deepening discloses a reference signal construction method, and the reference signal is defined by constructing a plurality of single-frequency signals arranged at equal intervals in a frequency domain, so that the reference signal presents a more personalized feature and is easier to be identified. The reference signal is outside the band that the human ear can perceive, so the reference signal can not form interference which the human ear can perceive for the sound of the sound source. In addition, the reference signal also fully considers the requirement of audio loopback, and a mute signal is preset before the reference signal, so that the occurrence of a single-frequency signal lags behind the mute signal, the acquisition side can have sufficient time to wait for the occurrence of the reference signal, and the success rate of detecting the reference signal is further improved.
Referring to fig. 4, in an expanded embodiment, after the step S1230 starts the detection of the reference signal in the real-time audio data, the method includes the following steps:
step S2310, tracking the noise signal along the time domain for the real-time audio data:
as mentioned above, the reference signal is configured as a high-frequency signal and is easily interfered by the high-frequency signal, so that in order to prevent the influence of high-frequency noise on the detection of the reference signal, it is necessary to perform tracking on the noise signal in the real-time audio data on the acquisition side. The tracking of the noise signal can be realized by means of various common algorithms by those skilled in the art, and in the present application, it is recommended to use an MCRA series algorithm, especially an IMCRA algorithm, to perform frequency point noise tracking, where the IMCRA algorithm is an algorithm for tracking a minimum value of a frequency point in a time domain, and is known by those skilled in the art.
Step S2320, the noise signal is transformed to a frequency domain, and corresponding noise energy spectrum data is obtained:
in order to facilitate the energy of the noise signal to be referred to for reference, the noise signal in the real-time audio data needs to be transformed into the frequency domain, and corresponding noise energy spectrum data is obtained. The common algorithm for calculating the frequency domain energy uses an FFT algorithm, which calculates the energy value of the entire frequency domain frequency point, so that the noise energy spectrum data corresponding to the noise signal can be converted by applying the FFT algorithm in the embodiment.
In the embodiment optimized on the basis of this embodiment, considering that the reference signal is flexibly customized in the foregoing embodiment and only includes a plurality of single-frequency signals, it is only necessary to calculate the energy values corresponding to the plurality of frequency points added on the playing side, for example, the energy values corresponding to the 3 frequency points disclosed in the foregoing embodiment, in this case, the FFT used occupies a large algorithm complexity, and the FFT complexity is o (nlogn). Therefore, in an optimized embodiment, Goertzel algorithm (gratzel algorithm) is recommended to be used for calculating the frequency domain energy, the algorithm complexity is o (n), the algorithm complexity is greatly reduced, and the estimated frequency point noise is represented as: lambda (f)i),i=0,1,2。
It will be appreciated that one skilled in the art may transform the noise signal from the time domain to the frequency domain in a variety of ways to obtain corresponding noise energy spectral data.
Step S2330, detecting the reference signal according to the speech energy spectrum data mapped by the speech frame of the real-time audio data and the noise energy spectrum data:
accordingly, each speech frame in the real-time audio data is also transformed from the time domain to the frequency domain using a fourier transform algorithm to obtain speech energy spectrum data for determining its corresponding energy value. On the basis, the reference signal can be detected according to the noise energy spectrum data and the voice energy spectrum data.
Referring to fig. 5, an embodiment corresponding to the reference signal specifically defined in the previous embodiment continues to provide a detailed process for detecting the reference signal, which specifically includes the following steps:
step S2331, according to the voice energy spectrum data mapped by each voice frame in the real-time audio data, positioning the peak position of each frequency point:
when ref _ s is added on the playback side0And (t) when the signal is received, the acquisition side starts to detect.
Aiming at the voice energy spectrum data mapped by the voice frame in the real-time audio data, the acquisition side firstly judges whether the current frequency point is a peak value, and the peak value is expressed as Ppeak(i)
Figure BDA0003285648870000151
Wherein E (f)i) The energy value of the current frequency point calculated by the Goertzel algorithm, and the energy of the current detection frequency should be the peak value of the upper and lower frequency points because the added reference signal is separated by delta f.
Step S2332, calculating the existence probability of the reference signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point:
calculating the existence probability P of the reference signal according to the energy of each frequency point and the energy of the background noisef(i)The following formula is used:
Figure BDA0003285648870000152
wherein EmaxAn energy threshold value indicating that the reference signal exists in the current frame, namely when the energy value of the reference signal exceeds a preset energy threshold value, it is determined that the reference signal exists in the corresponding voice frame highly falsely, and the energy threshold value is generally taken as Emax(fi)=1.8*E(fi) And the adjustment can be properly and flexibly carried out according to the actual situation. logE (f)i)-logλ(fi) Which can be understood as the signal-to-noise ratio of the current frequency point.
One frame signal satisfies Ppeak(i)When the reference signal is 1, the reference signal is possible, so that the two features are combined to obtain the reference signal existence probability of the current frame:
Figure BDA0003285648870000153
wherein W is the weight, the requirement
Figure BDA0003285648870000154
Generally take Wf(i)1/3, i is 0,1, 2. In particular, if some devices collect weak signals at high frequencies, the weights at the frequencies can be adjusted appropriately.
Step S2333, when the probability of existence of a plurality of consecutive speech frames satisfies a preset condition, determining that the reference signal is detected:
in an exemplary embodiment, the playback side adds the reference signal for 200ms in the time domain, so a time domain factor is introduced during detection at the acquisition side, and the inter-frame Probability preservation ipp (inter frame Probability persistence) is defined as follows:
Figure BDA0003285648870000161
in the formula, T is the number of frames for which the inter-frame probability is maintained, and is generally taken when T is 3, IPP is 1T 6, and IPP is 0.8, which is considered to be certain that detection is detectedA reference signal. This time with the time the reference signal was first detected, i.e. the second instant, and the first addition of ref _ s0The time of (t), i.e. the first time, the time interval between the two times is the audio loopback delay.
Generally, when the reference signal is detected, the playback side does not add the reference signal any more, and in an ideal case of actual measurement, the playback side needs to add the reference signal of 3 frames. The signal is detected by the method, and the actually measured detection error is within 20 ms.
The embodiment shows that, in the process of starting the detection of the reference signal, the detection of the reference signal is performed according to the noise energy and the speech energy, wherein, a plurality of single-frequency signals are detected by means of probability estimation, and the reference signal is ensured to be detected through a plurality of speech frames in consideration of interframe retention, so that the given algorithm is simple and easy to implement, has high operation efficiency, can control the detection error within 20ms, and is very accurate.
When the reference signal is added at the playing side, the service logic at the acquisition side starts to detect the reference signal, the detection times can be flexibly set, for example, twice, and when the reference signal is detected, the corresponding loopback delay value can be determined and stored for calling. In this case, the reference signal reconstruction mechanism may be enabled if the reference signal is not detected for consecutive iterations, resulting in an inability to compute the loopback delay value.
In an extended embodiment, after the step S1230 starts the detection of the reference signal in the real-time audio data, the following steps may be performed in response to the event that the detection of the reference signal in the step S2330 fails, specifically, in response to the event that the reference signal in the step S2333 fails to be detected:
step S2340, in response to an event that the detection of the reference signal fails, reconstructing the reference signal and starting secondary detection, wherein the reconstructed reference signal is a high-frequency signal outside the hearing frequency band of the human ear, and the frequency of the reconstructed reference signal is lower than that of the previously constructed reference signal.
Therefore, the frequency of the reconstructed reference signal is controlled below the frequency point of the previous reference signal, mainly considering the hardware reason: the energy of the high-frequency signal collected by the current terminal equipment is weak. For this reason, it is necessary to adjust the frequency, and even the amplitude, of the reference signal appropriately in order to be detected by the service logic on the acquisition side.
As an example, the reconstructed reference signal may be expressed as:
Figure BDA0003285648870000171
wherein:
Figure BDA0003285648870000172
in general, α10,f1i<fi. In practice, generally, α is taken1=10922,f1iThe frequency of (2) is around 19 KHz.
After reconstructing the new reference signal, the detection procedure may be executed again according to the procedure disclosed in the foregoing embodiment, so as to successfully detect the new reference signal by changing the reference signal, and thus the loopback delay value may be determined successfully.
In an actual application scene, the method and the device have the advantages that the audio loopback delay is calculated in real time, additional operation of a user is not needed, the problem of alignment of voice and accompaniment in remote karaoke scenes and live singing scenes is solved, and user experience in the network live broadcasting process is improved. After the technical scheme of the application calculates the audio loopback delay value in real time, the alignment of the voice and the accompaniment can be realized more strictly, and the experience of other products of the same kind is advanced. The following table is a comparison between the 5-time audio loopback delay value calculated by implementing the technical scheme of the present application on 4 mobile terminal devices and an artificial measurement result:
Figure BDA0003285648870000173
in the table, Android models OP x R x N3 and V x xpl x 6 were all subjected to collection play tests with JAVA API under media sound. As can be seen from the table, the detection error of the present application is controlled within 20 ms.
Referring to fig. 6, an audio playback control apparatus provided in the present application is adapted to perform functional deployment by an audio playback control method of the present application, and includes:
the audio acquisition module 1100 is configured to continuously acquire real-time audio data generated by a local external input sound source in a live webcast process;
a loopback detection module 1200, configured to detect a reference signal looped back by being externally played in the audio data pre-added to the local play sound source, and determine a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear;
the audio mixing correction module 1300 is configured to control audio data of a local playing sound source to be mixed with the real-time audio data after being superimposed with the loopback delay value, so as to obtain corrected audio data;
and the audio stream pushing module 1400 is configured to output the corrected audio data as an audio stream of live webcasting.
In a further embodiment, the audio capture module 1100 comprises:
the local play-out submodule is used for acquiring audio data corresponding to a local play sound source and playing the audio data out;
the local acquisition submodule is used for continuously acquiring external voice signals generated by an external input sound source;
and the voice conversion submodule is used for converting the external voice signal into the real-time audio data.
In a further embodiment, the loopback detection module 1200 includes:
a signal construction sub-module for constructing the reference signal;
the signal presetting submodule is used for adding the reference signal to the audio data of the local playing sound source at a first moment for external playing;
the signal detection submodule starts the detection of the reference signal in the real-time audio data and determines a second moment when the reference signal is detected;
and the delay calculation submodule is used for determining the difference value between the second moment and the first moment as the loopback delay value.
In an expanded embodiment, the signal detection sub-module includes:
the noise tracking unit is used for tracking a noise signal along a time domain for the real-time audio data;
the noise transformation unit is used for transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;
and the signal measuring and calculating unit is used for detecting the reference signal according to the voice energy spectrum data mapped by the voice frame of the real-time audio data and the noise energy spectrum data.
In an embodied embodiment, the signal evaluation unit comprises:
the peak positioning subunit is used for positioning the peak position of each frequency point according to the voice energy spectrum data mapped by each voice frame in the real-time audio data;
the probability evaluation subunit is used for calculating the existence probability of the reference signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;
and the signal judgment subunit is used for judging that the reference signal is detected when the existence probability of a plurality of continuous voice frames meets a preset condition.
In a preferred embodiment, the reference signal comprises a plurality of single-frequency signals, and each single-frequency signal is arranged at equal intervals in a frequency domain.
In an expanded embodiment, the signal detection sub-module further includes:
and the restart detection unit is used for reconstructing the reference signal and starting secondary detection in response to the event of failure of detecting the reference signal, wherein the reconstructed reference signal is a high-frequency signal outside the hearing frequency band of the human ear, and the frequency of the reconstructed reference signal is lower than that of the previously constructed reference signal.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Fig. 7 is a schematic diagram of the internal structure of the computer device. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can make the processor implement an audio playing control method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions, which, when executed by the processor, may cause the processor to perform the audio playback control method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 6, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the audio playback control device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the audio playback control method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer program/instructions which, when executed by one or more processors, implement the steps of the audio playback control method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, the method and the device perform the loopback delay value estimation of the audio signal according to the preset out-of-band high-frequency reference signal of the human ear hearing frequency band, can estimate the loopback delay value of the audio signal more accurately and efficiently for correcting the audio signal, realize the sounding alignment of multiple sound sources, and have wide application prospects.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. An audio playing control method is characterized by comprising the following steps:
continuously acquiring real-time audio data generated by a local external input sound source in the live webcasting process;
detecting a reference signal which is added to the audio data of the local playing sound source in advance and looped back by the external playing, and determining a corresponding loopback delay value; the reference signal is a high-frequency signal outside the hearing frequency band of the human ear;
controlling audio data of a local playing sound source to be mixed with the real-time audio data after being superposed with the loopback delay value to obtain corrected audio data;
and outputting the corrected audio data as an audio stream of the network live broadcast.
2. The audio playing control method of claim 1, wherein the step of continuously acquiring real-time audio data generated by a local external input audio source during the live webcasting process comprises the steps of:
acquiring audio data corresponding to a local playing sound source for playing;
continuously collecting external voice signals generated by an external input sound source;
converting the external voice signal into the real-time audio data.
3. The audio playback control method of claim 1, wherein the step of detecting a reference signal looped back by the playback in the audio data pre-added to the local playback audio source and determining a corresponding loop back delay value comprises the steps of:
constructing the reference signal;
adding the reference signal to the audio data of a local playing sound source at a first moment for playing out;
starting detection of the reference signal in the real-time audio data, and determining a second moment when the reference signal is detected;
and determining the difference value between the second time and the first time as the loopback delay value.
4. The audio playback control method of claim 3, wherein after the detection of the reference signal in the real-time audio data is initiated, the method comprises the following steps:
tracking a noise signal along a time domain for the real-time audio data;
transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;
and detecting the reference signal according to the voice energy spectrum data mapped by the voice frame of the real-time audio data and the noise energy spectrum data.
5. The audio playing control method according to claim 4, wherein the step of detecting the reference signal according to the speech energy spectrum data mapped by the speech frame of the real-time audio data and the noise energy spectrum data comprises the steps of:
positioning the peak position of each frequency point according to the voice energy spectrum data mapped by each voice frame in the real-time audio data;
calculating the existence probability of the reference signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;
and judging to detect the reference signal when the existence probability of a plurality of continuous voice frames meets a preset condition.
6. The audio playback control method according to any one of claims 1 to 5, wherein the reference signal includes a plurality of single-frequency signals, and each single-frequency signal is arranged at equal intervals in a frequency domain.
7. The audio playback control method according to any one of claims 3 to 5, wherein after the detection of the reference signal in the real-time audio data is started, the method comprises the following steps:
in response to an event of failure to detect the reference signal, reconstructing the reference signal and initiating a secondary detection, the reconstructed reference signal being a high frequency signal outside the hearing band of the human ear having a frequency lower than the frequency of the previously constructed reference signal.
8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.
9. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.
10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 7.
CN202111146341.6A 2021-09-28 2021-09-28 Audio playing control method and device, equipment, medium and product thereof Pending CN113891152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111146341.6A CN113891152A (en) 2021-09-28 2021-09-28 Audio playing control method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111146341.6A CN113891152A (en) 2021-09-28 2021-09-28 Audio playing control method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN113891152A true CN113891152A (en) 2022-01-04

Family

ID=79007652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111146341.6A Pending CN113891152A (en) 2021-09-28 2021-09-28 Audio playing control method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN113891152A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115942066A (en) * 2022-12-06 2023-04-07 腾讯音乐娱乐科技(深圳)有限公司 Audio live broadcasting method, electronic equipment and computer readable storage medium
CN116168712A (en) * 2023-02-23 2023-05-26 广州趣研网络科技有限公司 Audio delay cancellation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180352095A1 (en) * 2016-03-21 2018-12-06 Tencent Technology (Shenzhen) Company Limited Echo time delay detection method, echo elimination chip, and terminal equipment
CN110931053A (en) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 Method, device, terminal and storage medium for detecting recording time delay and recording audio
CN110970045A (en) * 2019-11-15 2020-04-07 北京达佳互联信息技术有限公司 Mixing processing method, mixing processing device, electronic equipment and storage medium
CN111402910A (en) * 2018-12-17 2020-07-10 华为技术有限公司 Method and equipment for eliminating echo
CN112509595A (en) * 2020-11-06 2021-03-16 广州小鹏汽车科技有限公司 Audio data processing method, system and storage medium
CN113286161A (en) * 2021-05-19 2021-08-20 广州虎牙科技有限公司 Live broadcast method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180352095A1 (en) * 2016-03-21 2018-12-06 Tencent Technology (Shenzhen) Company Limited Echo time delay detection method, echo elimination chip, and terminal equipment
CN111402910A (en) * 2018-12-17 2020-07-10 华为技术有限公司 Method and equipment for eliminating echo
CN110970045A (en) * 2019-11-15 2020-04-07 北京达佳互联信息技术有限公司 Mixing processing method, mixing processing device, electronic equipment and storage medium
CN110931053A (en) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 Method, device, terminal and storage medium for detecting recording time delay and recording audio
CN112509595A (en) * 2020-11-06 2021-03-16 广州小鹏汽车科技有限公司 Audio data processing method, system and storage medium
CN113286161A (en) * 2021-05-19 2021-08-20 广州虎牙科技有限公司 Live broadcast method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115942066A (en) * 2022-12-06 2023-04-07 腾讯音乐娱乐科技(深圳)有限公司 Audio live broadcasting method, electronic equipment and computer readable storage medium
CN116168712A (en) * 2023-02-23 2023-05-26 广州趣研网络科技有限公司 Audio delay cancellation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105472191B (en) A kind of method and apparatus tracking echo delay time
EP2652737B1 (en) Noise reduction system with remote noise detector
JP5957446B2 (en) Sound processing system and method
US7647229B2 (en) Time scaling of multi-channel audio signals
CN113891152A (en) Audio playing control method and device, equipment, medium and product thereof
US8233630B2 (en) Test apparatus, test method, and computer program
US9042574B2 (en) Processing audio signals
CN110390925B (en) Method for synchronizing voice and accompaniment, terminal, Bluetooth device and storage medium
KR101987473B1 (en) System for synchronization between accompaniment and singing voice of online singing room service and apparatus for executing the same
WO2012042295A1 (en) Audio scene apparatuses and methods
CN113938746B (en) Network live broadcast audio processing method and device, equipment, medium and product thereof
US20090129605A1 (en) Apparatus and methods for augmenting a musical instrument using a mobile terminal
US11146901B2 (en) Crowd-sourced device latency estimation for synchronization of recordings in vocal capture applications
JP2004282700A (en) Echo detection and monitoring
US11785406B2 (en) Inter-channel level difference based acoustic tap detection
CN110428798B (en) Method for synchronizing voice and accompaniment, Bluetooth device, terminal and storage medium
CN111402910B (en) Method and equipment for eliminating echo
CN110169082A (en) Combining audio signals output
US11729236B2 (en) Sampling rate processing method, apparatus, and system, storage medium, and computer device
KR20010106412A (en) Real-time quality analyzer for voice and audio signals
JP5611393B2 (en) Delay time measuring apparatus, delay time measuring method and program
CN113470673A (en) Data processing method, device, equipment and storage medium
CN114401432B (en) MV playing method, playing terminal, server equipment and entertainment equipment system
CN113611266B (en) Audio synchronization method, device and storage medium suitable for multi-user K songs
CN215818639U (en) Time delay compensation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination