CN114822568A - Audio playing method, device, equipment and computer readable storage medium - Google Patents

Audio playing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114822568A
CN114822568A CN202210632201.8A CN202210632201A CN114822568A CN 114822568 A CN114822568 A CN 114822568A CN 202210632201 A CN202210632201 A CN 202210632201A CN 114822568 A CN114822568 A CN 114822568A
Authority
CN
China
Prior art keywords
signal
audio signal
audio
target
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210632201.8A
Other languages
Chinese (zh)
Inventor
李新林
马连群
吴宜安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth RGB Electronics Co Ltd
Original Assignee
Shenzhen Skyworth RGB Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth RGB Electronics Co Ltd filed Critical Shenzhen Skyworth RGB Electronics Co Ltd
Priority to CN202210632201.8A priority Critical patent/CN114822568A/en
Publication of CN114822568A publication Critical patent/CN114822568A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Abstract

The invention discloses an audio playing method, device and equipment and a computer readable storage medium, and belongs to the technical field of audio and video playing. The method comprises the steps of monitoring whether a preset characteristic image exists in a currently output video picture; if so, acquiring the sounding position information of the preset characteristic image; acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal after the sound field position is reconstructed; and outputting the target audio signal for audio playing. The invention solves the technical problems that the voice position cannot be accurately restored when the audio is played and the voice presence sense is poor, and realizes the technical effect of improving the voice presence sense and the recognition degree when the audio is played.

Description

Audio playing method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of audio and video processing technologies, and in particular, to an audio playing method, apparatus, device, and computer-readable storage medium.
Background
With the increasing living standard, people put higher requirements on the performance and functions of entertainment products in daily life. When watching video programs by using video playing devices such as tablet computers and televisions, people often want to obtain more realistic audio-visual experience, and therefore, higher requirements are put forward on the audio quality of the video playing devices.
Generally, in a movie program, character conversations usually occupy a large scene, so that the position accuracy of a speaker is improved, the voice heard by audiences is consistent with the voice of the speaker on a screen, the telepresence of the audiences can be enhanced, and the user experience is greatly improved. At present, most of playing devices use a left loudspeaker and a right loudspeaker to produce sound, sound sources of double-channel stereo sound are well restored, but for double-channel sound sources with single channel or weak stereo sound, the audio playing effect is not ideal. In addition, there are some effects of improving and creating a virtual stereo sound by the virtual surround sound technology, but since it uses a fixed algorithm, the localization of the sound is not accurate enough.
Therefore, the technical problems that the voice position cannot be accurately restored when the audio is played and the voice presence is poor exist in the prior art.
Disclosure of Invention
The invention mainly aims to provide an audio playing method, device and equipment and a computer readable storage medium, aiming at solving the technical problems that the voice position cannot be accurately restored when audio is played and the voice presence is poor.
In order to achieve the above object, the present invention provides an audio playing method, including the following steps:
monitoring whether a preset characteristic image exists in a currently output video picture;
if so, acquiring the sounding position information of the preset characteristic image;
acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal after the sound field position is reconstructed;
and outputting the target audio signal for audio playing.
Optionally, the step of obtaining the sound emission position information of the preset feature image includes:
acquiring distance information of the preset feature image according to the size of the preset feature image;
and acquiring the coordinate information of the sound production position of the preset characteristic image, and taking the coordinate information of the sound production position and the distance information as the sound production position information.
Optionally, the step of adjusting the original audio signal according to the utterance position information to obtain a target audio signal with a reconstructed sound field position includes:
adjusting the voice signal in the original audio signal according to the sounding position information to obtain a target voice signal with a reconstructed sound field position;
and mixing the target audio signal with the background audio signal in the original audio signal to obtain the target audio signal with the reconstructed sound field position.
Optionally, the step of adjusting the speech signal in the original audio signal according to the utterance position information to obtain a target speech signal after the sound field position is reconstructed includes:
respectively acquiring a first coefficient and a second coefficient according to the sounding position coordinate information and the distance information;
adjusting parameters of the voice signal according to the first coefficient to obtain a left channel voice enhancement signal;
adjusting the parameters of the voice signal according to the second coefficient to obtain a right channel voice enhancement signal;
and taking the left channel voice enhancement signal and the right channel voice enhancement signal as target voice signals after the sound field position is rebuilt.
Optionally, before the step of adjusting the speech signal in the original audio signal according to the utterance position information to obtain a target speech signal after reconstructing a sound field position, the method further includes:
and separating the original audio signal to obtain the background sound signal and the voice signal.
Optionally, the preset feature image is: the step of monitoring whether the preset characteristic image exists in the currently output video picture or not comprises the following steps:
extracting the video picture in the currently output video data at intervals of preset duration;
and identifying the face image in the video picture so as to monitor whether the image with the opened lips of the person exists in the face image.
Optionally, the step of outputting the target audio signal for audio playing includes:
sending the target audio signal to a power amplifier to convert the target audio signal into a corresponding analog signal;
and driving a corresponding loudspeaker through the analog signal so as to play audio.
In addition, the present invention also provides an audio playing device, comprising:
the judging module is used for monitoring whether a preset characteristic image exists in a currently output video picture;
the acquisition module is used for acquiring the sounding position information of the preset characteristic image if the preset characteristic image exists;
the adjusting module is used for acquiring an original audio signal corresponding to the video picture, adjusting the original audio signal according to the sounding position information and obtaining a target audio signal after the sound field position is reconstructed;
and the playing module is used for outputting the target audio signal to play audio.
Optionally, the apparatus further comprises:
and the separation module is used for separating the original audio signal to obtain the background sound signal and the voice signal.
The steps of the audio playing method of the present invention can be referred to as the steps of the functional modules of the audio playing device of the present invention when running, and are not described herein again.
In addition, the present invention also provides an audio playback apparatus, including: a memory, a processor and an audio playback program stored on the memory and executable on the processor, the audio playback program being configured to implement the steps of the audio playback method as described above.
In addition, the present invention also provides a computer readable storage medium, wherein an audio playing program is stored on the computer readable storage medium, and when being executed by a processor, the audio playing program realizes the steps of the audio playing method.
The method comprises the steps of monitoring whether a preset characteristic image exists in a currently output video picture; if so, acquiring the sounding position information of the preset characteristic image; acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal after the sound field position is reconstructed; and outputting the target audio signal for audio playing.
According to the method and the device, the original audio signal is adjusted according to the sounding position information, the sound field position of the original audio signal is reconstructed to obtain the target audio signal, and then audio playing is carried out, so that when a user watches a video, the perceived voice position is consistent with the sounding position in the seen video picture. The technical problems that the voice position cannot be accurately restored when audio is played and the voice presence sense is poor are solved, the voice presence sense and the recognition degree when the audio is played are improved, and therefore the watching experience of a user is improved.
Drawings
Fig. 1 is a schematic structural diagram of an audio playing device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an audio playing method according to an embodiment of the present invention;
FIG. 3 is a schematic flowchart illustrating an embodiment of an audio playing method according to the present invention;
FIG. 4 is a diagram illustrating a workflow of functional modules according to an embodiment of an audio playing method;
fig. 5 is a schematic diagram of a functional module structure of an audio playback device according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context. The terms "or," "and/or," "including at least one of the following," and the like, as used herein, are to be construed as inclusive or mean any one or any combination.
It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an audio playing device in a hardware operating environment according to an embodiment of the present invention. The audio playing device can be an electronic device which can play audio and video, such as a television, a mobile phone, a tablet computer and the like. It should be noted that the audio playing device of the present invention does not include a device that is used for playing music only, such as a recorder.
As shown in fig. 1, the audio playback apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the audio playback device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and an audio playback program.
In the audio playback device shown in fig. 1, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the audio playing device of the present invention may be disposed in the audio playing device, and the audio playing device calls the audio playing program stored in the memory 1005 through the processor 1001 and performs the following operations:
monitoring whether a preset characteristic image exists in a currently output video picture;
if so, acquiring the sounding position information of the preset characteristic image;
acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal after the sound field position is reconstructed;
and outputting the target audio signal for audio playing.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
acquiring distance information of the preset feature image according to the size of the preset feature image;
and acquiring the coordinate information of the sound production position of the preset characteristic image, and taking the coordinate information of the sound production position and the distance information as the sound production position information.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
adjusting the voice signal in the original audio signal according to the sounding position information to obtain a target voice signal with a reconstructed sound field position;
and mixing the target audio signal with the background audio signal in the original audio signal to obtain the target audio signal with the reconstructed sound field position.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
respectively acquiring a first coefficient and a second coefficient according to the sounding position coordinate information and the distance information;
adjusting parameters of the voice signal according to the first coefficient to obtain a left channel voice enhancement signal;
adjusting the parameters of the voice signal according to the second coefficient to obtain a right channel voice enhancement signal;
and taking the left channel voice enhancement signal and the right channel voice enhancement signal as target voice signals after the sound field position is rebuilt.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
and separating the original audio signal to obtain the background sound signal and the voice signal.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
extracting the video picture in the currently output video data at intervals of preset duration;
and identifying the face image in the video picture so as to monitor whether the image with the opened lips of the person exists in the face image.
Further, the processor 1001 may be configured to call an audio playing program stored in the memory 1005, and further perform the following operations:
sending the target audio signal to a power amplifier to convert the target audio signal into a corresponding analog signal;
and driving a corresponding loudspeaker through the analog signal so as to play audio.
In order to improve the watching experience of users, the requirements on the image quality and sound effect of playing equipment are high when the video programs are played. When the video sound source is a dual-channel stereo sound, the playing device plays audio through the left and right sets of loudspeakers, and a good stereo sound effect can be achieved generally. However, when the video sound source is a single audio or has poor stereo, the phenomenon that the sound is different from the sounding position can occur only by depending on the left and right groups of loudspeakers, the recognition degree is low, and the telepresence of the voice is poor. In view of the above situation, the prior art proposes a method for improving and creating virtual stereo based on a virtual surround technology, but this method uses a fixed algorithm, and cannot achieve accurate positioning of sound. Therefore, the technical problems that the voice position cannot be accurately restored when the audio is played and the voice presence is poor exist in the prior art.
In order to solve the above technical problem, the present invention provides an audio playing method, including: monitoring whether a preset characteristic image exists in a currently output video picture; if yes, acquiring the sound production position information of the preset feature image; acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal after the sound field position is reconstructed; and outputting the target audio signal for audio playing.
The method comprises the steps of acquiring sound production position information of a preset characteristic image, adjusting an original audio signal according to the sound production position information, reconstructing a sound field position of the original audio signal to obtain a target audio signal, and then playing audio. Because the sound field position of the audio signal is reconstructed according to the sound production position information, when the user watches the video, the perceived voice position is consistent with the sound production position in the seen video picture. The technical problems that the voice position cannot be accurately restored when audio is played and the voice presence sense is poor are solved, the voice presence sense and the recognition degree when the audio is played are improved, and therefore the watching experience of a user is improved.
Referring to fig. 2, fig. 2 is a schematic flowchart illustrating an audio playing method according to an embodiment of the present invention.
In this embodiment, the audio playing method includes:
step S10, it is monitored whether there is a preset feature image in the currently output video frame.
In this embodiment, the execution main body is an electronic device capable of playing video, for example, a television, a mobile phone, a tablet computer, and the like. The audio playing method can be applied to the playing process of videos which contain characters and make the characters send out speaking sounds, so that the positions of the characters and the speakers in the video pictures are consistent when the audio is played. The preset feature image refers to an image that can represent a person speaking or other sound emitting position, and may be, for example, an image in which lips of the person open or an image including subtitles and the person.
The currently output video picture refers to a video picture being played on the device, and whether a preset characteristic image exists in the currently output video picture is detected, for example, whether an image of a person speaking or making a sound exists in the currently played video picture is detected.
Specifically, for example, a video picture of a currently playing video is extracted, and whether a preset feature image exists in the video picture is identified.
And step S20, if yes, acquiring the sound production position information of the preset characteristic image.
In this embodiment, if there is a preset feature image indicating that a person utters a sound in a video image being played at this time, the utterance position information of the preset feature image is acquired. The sound production position information is the position information of the preset characteristic image in the current video picture or on the display screen, and is used for reconstructing the sound field information of the original audio signal, so that the voice produced during audio playing is consistent with the position of the speaker in the video picture.
It should be noted that, because the video frame is usually a frame of a stereoscopic space scene, the sound-emitting position information usually includes position coordinate information of a two-dimensional plane of the preset feature image in the video frame, that is, specific orientation information of the sound-emitting position located at the upper, lower, left, and right orientations of the frame; the distance information of the three-dimensional space, namely the relative position information of the sound production position and the audience, for example, the distance information of whether the speaker is speaking at a near place or a far place in the video picture, or the distance information of whether the speaker is speaking at a left side or a right side in the video picture, is also included.
And step S30, acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal with a reconstructed sound field position.
In this embodiment, the original audio signal, that is, the original audio signal in the video source corresponding to the video picture, may be obtained by collecting the input audio signal from the front end of the device. It should be noted that, because a person usually continues for a certain period of time while speaking, acquiring or collecting an original audio signal also needs to be continuously acquired or collected. For example, the original audio signals corresponding to the video pictures are continuously collected and stored in the buffer area.
The target audio signal is obtained by adjusting the original audio signal according to the sounding position information, and the audio signal of the sound field position is reconstructed. Where the sound field position is the position and direction of the sound source. Because the human brain senses the sound field position according to the time difference and the volume difference of the heard sound, the sound field position can be reconstructed by adjusting the parameters of the amplitude, the frequency, the phase position and the like of the original audio signal, and the target audio signal is obtained.
And adjusting the original audio signal according to the sound production position information, so that the sound field position of the original audio signal coincides with the sound production position. It should be noted that the stereo effect of the sound is generated by the difference that the sound source reaches the left ear and the right ear at different positions, and therefore, the stereo effect is usually generated by the difference between the left and the right audio signals, that is, the target audio signal at least needs to include the audio signals of the left and the right channels, so as to achieve the effect that the sound field position of the target audio signal is consistent with the sound production position of the preset feature image.
And step S40, outputting the target audio signal for audio playing.
In this embodiment, the target audio signal with the reconstructed sound field position is output to a power amplifier of the playback device, and then audio playback is performed through a speaker. Since the target audio signal is obtained on the basis of reconstructing the sound field position of the original audio signal, at this time, when the audio is played, the sound field position is consistent with the sound production position of the preset characteristic image in the video picture, that is, the human voice heard by the user is sent from the position of the corresponding speaker in the video picture.
Optionally, in step S40, outputting the target audio signal for audio playback includes:
step S41, sending the target audio signal to a power amplifier to convert the target audio signal into a corresponding analog signal.
In this embodiment, a power amplifier, i.e. a power amplifier, refers to an amplifier that can generate maximum power output to drive a load (e.g. a speaker) under a given distortion rate condition. The power amplifier can convert the received target audio signal into an analog signal.
In specific implementation, if the target audio signal includes audio signals of left and right channels, two corresponding power amplifiers are present, and are respectively used for converting the audio signals of the left and right channels, so as to obtain corresponding analog signals and drive corresponding speakers to produce sound.
And step S42, driving a corresponding loudspeaker through the analog signal to play audio.
In this embodiment, the audio playing can be realized by driving the corresponding speaker to sound through the analog signal. For example, the audio signals of the left and right channels in the target audio signal are used to drive the corresponding left and right speakers to sound, so as to realize audio playing.
In the embodiment, the sound production position information of the preset characteristic image is acquired, the original audio signal is adjusted according to the sound production position information, and after the sound field position of the original audio signal is reconstructed to obtain the target audio signal, audio playing is performed. Because the sound field position of the audio corresponding to the video is reconstructed according to the sound production position information in the video picture, the perceived voice position can be consistent with the sound production position in the seen video picture when the user watches the video. The technical problems that the voice position cannot be accurately restored when audio is played and the voice presence sense is poor are solved, and the voice presence sense and the recognition degree when the audio is played are improved, so that the watching experience of a user is improved, and meanwhile, the video playing effect is also improved.
Further, in another embodiment of the audio playing method of the present invention, in step S20, the acquiring the sound-emitting position information of the preset feature image includes:
step S21, obtaining distance information of the preset feature image according to the size of the preset feature image.
In this embodiment, the distance information is a relative distance between the preset feature image and the audience visually when it is monitored that the preset feature image exists in the currently output video picture, for example, when the preset feature image is an image in which lips of an object are opened, the distance information is distance information of the object being spoken in the currently played video picture and the object being spoken is near or far from the audience in the current picture scene, that is, the speaker speaks near or far.
Specifically, the distance of the sound production position can be estimated according to the size of the recognized preset feature image, and the distance information of the required preset feature image can be obtained by normalizing to the range from the near to the far of (1, 10).
Step S22, acquiring the sound emission position coordinate information of the preset feature image, and using the sound emission position coordinate information and the distance information as the sound emission position information.
In this embodiment, the sound-emitting position coordinate information is plane position information of the sound-emitting position of the preset feature image in the video picture, for example, when the preset feature image is an image of an opened lip of a human being, the sound-emitting position coordinate information is azimuth information of the lip in the video picture in the vertical and horizontal directions. For example, whether a person is speaking to the left or right of the screen.
Specifically, coordinate information of a sound emission position of a preset feature image is acquired and normalized to a screen range of (0, 100) to obtain the coordinate information of the sound emission position. For example, when the preset feature image is an image in which the lips of a person are opened, the [ x, y ] orthogonal coordinates are adopted to mark the position coordinates of the lips of the person, and then normalization is performed to obtain the position coordinates, so that the sounding position coordinate information can be obtained.
The utterance position information may be three-dimensional coordinates obtained from utterance position coordinate information and distance information, for example, utterance position information (x, y, z) obtained from utterance position coordinate information (x, y) and distance information z. And the subsequent step of adjusting the original audio signal according to the sound production position information, namely actually adjusting the original audio signal according to the sound production position coordinate information and the distance information.
In the embodiment, the sound production position information is obtained by obtaining the sound production position coordinate information and the distance information of the preset image characteristics, so that the original audio signal can be adjusted subsequently, the original audio signal can be adjusted accurately, and the target audio signal with the reconstructed sound field position can be obtained.
Further, in another embodiment of the audio playing method of the present invention, in step S30, the adjusting the original audio signal according to the utterance position information to obtain a target audio signal after reconstructing a sound field position includes:
and step S31, adjusting the voice signal in the original audio signal according to the sound production position information to obtain a target voice signal with a reconstructed sound field position.
In this embodiment, the original audio signal generally includes a background sound signal and a speech signal, where the speech signal refers to a signal of a speech sound of a person in the audio signal, for example, a conversation between the persons; the background sound signal refers to other sound signals besides the speaking sound, such as background music, music insertion, and the like. The invention is only suitable for the playing process of the video containing the character and the character utters the speaking voice, so that the position of the character voice uttered during the audio playing is consistent with the position of the speaker in the video picture, namely, the utterance position of the character is restored. Therefore, only the voice signal in the original audio signal needs to be adjusted according to the sounding position information, and the target voice signal with the reconstructed sound field position is obtained.
Optionally, in step S31, before the step of adjusting the speech signal in the original audio signal according to the utterance position information to obtain the target speech signal after reconstructing the sound field position, the method further includes:
step S01, separating the original audio signal to obtain the background sound signal and the speech signal.
In this embodiment, since only the parameters of the speech signal in the original audio signal need to be adjusted, the background sound signal in the original audio signal needs to be separated from the speech signal.
Specifically, the extraction of the speech signal may be implemented by using algorithms such as Kalman filtering (Kalman), least mean square algorithm (LMS), and Recursive Least Squares (RLS), so as to separate the original audio signal to obtain the background sound signal and the speech signal.
For better understanding, for example, assuming that the preset feature image is a lip of an object, as shown in fig. 3, first, a video picture is captured, and a position of the lip is identified; then collecting the input original audio signal, separating the voice signal and the background sound signal, namely separating the voice data and the background sound data in the picture; then adjusting the voice signal according to the lip position information, namely enhancing the voice stereoscopic impression in the picture; and mixing the adjusted voice signal with the previously separated background sound signal to obtain a target audio signal, and finally playing the audio through a loudspeaker.
According to the embodiment, the background sound signal and the voice signal are obtained by separating the original audio signal, so that the voice signal can be independently adjusted, the voice position can be restored, and the voice stereoscopic impression can be enhanced.
Optionally, in step S31, adjusting the speech signal in the original audio signal according to the utterance position information to obtain a target speech signal after reconstructing a sound field position, including:
step S311, a first coefficient and a second coefficient are respectively obtained according to the sounding position coordinate information and the distance information.
In this embodiment, the target speech signal includes a left channel speech enhancement signal and a right channel speech enhancement signal. The first coefficient is used for adjusting the parameters of the voice signal to obtain a left channel voice enhancement signal; the second coefficient is used for adjusting the parameters of the voice signal to obtain a right channel voice enhancement signal.
Specifically, suppose the utterance position coordinate information is [ X ] m ,Y m ]Distance information is H m The first coefficient is alpha L The second coefficient is alpha R Then can be according to [ X ] m ,Y m ]And H m Respectively calculating coefficients alpha of left and right sound channels L And alpha R The specific calculation method is as follows:
Figure BDA0003677698610000121
Figure BDA0003677698610000122
step S312, adjusting the parameters of the speech signal according to the first coefficient to obtain a left channel speech enhancement signal.
In this embodiment, parameters such as amplitude, frequency, and phase of the speech signal may be adjusted according to the first coefficient to obtain the left channel enhanced speech signal.
Specifically, assume that the speech signal is f 1 (t), then, can be to f 1 And (t) performing fast Fourier transform to a frequency domain, then multiplying the frequency domain by a frequency coefficient function determinant, and performing inverse fast Fourier transform to obtain a final left channel enhanced voice signal. Wherein the frequency coefficient function of the left channel is alpha L Each frequency is a different coefficient function derived from the sound exponential decay model plus experimental corrections. The method comprises the following specific steps:
f 1L (t)=IFFT([f 0L ) f 1L ) ... f nL )]*FFT(f 1 (t)))。
and step 313, adjusting the parameters of the voice signal according to the second coefficient to obtain a right channel voice enhancement signal.
In this embodiment, the right channel enhanced speech signal can be obtained by adjusting the parameters of the speech signal, such as amplitude, frequency, and phase, according to the second coefficient.
Specifically, assume that the speech signal is f 1 (t), then, can be to f 1 And (t) performing fast Fourier transform to a frequency domain, then multiplying the frequency domain by a frequency coefficient function determinant, and performing inverse fast Fourier transform to obtain a final right channel enhanced voice signal. Wherein the frequency coefficient function of the right channel is alpha R Each frequency is a different coefficient function derived from the sound exponential decay model plus experimental corrections. The method comprises the following specific steps:
f 1R (t)=IFFT([f 0R ) f 1R ) ... f nR )]*FFT(f 1 (t)))。
step S314, using the left channel speech enhancement signal and the right channel speech enhancement signal as the target speech signals after the sound field position is reconstructed.
In this embodiment, since the position of the speech is mainly determined according to the sound intensity, the phase and the time difference, to reconstruct the sound field position of the speech signal, it is necessary to perform different adjustments on the parameters of the speech signal to obtain different speech signals of the left and right channels, and then simulate and restore the position of the speech according to the sound intensity, the phase and the time difference of the speech signals of the left and right channels.
Therefore, reconstructing the target speech signal after the sound field position needs to include: the left channel speech enhancement signal and the right channel speech enhancement signal are used as target speech signals.
According to the embodiment, different coefficients are obtained through calculation according to the coordinate information and the distance information of the sounding position, so that different adjustments are performed on the parameters of the voice signals, and therefore voice enhancement signals of different left and right sound channels are obtained, namely target voice signals of a sound field position are reconstructed.
Step S32, mixing the target audio signal with the background audio signal in the original audio signal to obtain the target audio signal with the reconstructed sound field position.
In this embodiment, the background sound signal in the original audio signal is directly mixed with the speech signal after the sound field position is reconstructed without being processed, and the target audio signal after the sound field position is reconstructed can be obtained.
Specifically, as shown in fig. 4, after separating the speech signal and the background sound signal in the original audio signal, the first and second coefficients are obtained according to the utterance position information, the speech signal is adjusted respectively, that is, the speech stereo perception is enhanced, the left and right channel speech enhancement signals, that is, the left and right channel speech audio in the drawing, are obtained respectively, and then the background sound signal is mixed with the left and right channel speech enhancement signals respectively, so that the target speech signal including the left and right channel speech enhancement signals is obtained, and the final audio playing is realized through the left and right speakers respectively.
According to the embodiment, the voice signal in the original audio signal is independently adjusted, the adjusted voice signal is mixed with the background sound, the target audio signal is finally obtained, and the more accurate sound field position of the reconstructed voice signal is realized.
Further, in another embodiment of the audio playing method of the present invention, the step S10 of monitoring whether a preset feature image exists in the currently output video frame includes:
and step S11, extracting the video pictures in the currently output video data at preset time intervals.
In this embodiment, in order to accurately restore the voice position, the utterance position in the video frame needs to be obtained first, and therefore, the preset feature image may be set as an image in which the lips of the person are open, that is, an image representing possible utterance of the person.
To monitor whether a preset characteristic image exists in a video picture, a video picture to be identified needs to be acquired first. Because speaking and sounding are a continuous process when a video is played, and the video picture does not change too much in the process, the video picture is extracted only at intervals of preset duration for the sake of simplicity and reduction of computation workload, and whether a preset characteristic image exists in the video picture is monitored. The preset duration may be self-defined, for example, the preset duration may be 50 milliseconds to 300 milliseconds.
In addition, after the video pictures are obtained, the extracted video pictures can be normalized to be in a standard size, so that the subsequent identification process is facilitated on one hand; on the other hand, the excessive memory occupation caused by the overlarge video picture image can be reduced.
Specifically, for example, every 50 milliseconds, a video picture is captured through a screenshot function of the playing device, and the video picture is normalized to a standard size for recognizing preset image features.
Step S12, recognizing a face image in the video frame to monitor whether there is an image with open lips in the face image.
In this embodiment, in order to improve the accuracy of the recognition, because the existing face recognition technology is mature and the accuracy is high, whether a face image exists in the video picture can be recognized through the existing face recognition technology, and whether a picture with an open lip of a person exists in the face image is further recognized to recognize whether a preset feature image exists in the video picture.
Specifically, the face image can be identified by using algorithms or classifiers such as Fisherfaces, PCA, SVM and the like, after the face image is identified, the lower half part of the face image is obtained, and whether an image with opened lips exists or not is judged, that is, whether a preset feature image exists in the video image or not is identified.
In the embodiment, the video pictures are acquired at preset time intervals, and whether the image with the opened figure lip exists in the video pictures is identified based on the image identification technology, so that whether the preset image characteristics exist is judged, and then the original audio signal is adjusted to obtain the target audio signal for audio playing. Because the lips of the person are open, the person speaks the speaking sound, and the position of the corresponding audio signal needs to be restored at the moment, namely the original audio signal is adjusted when needed, so that the technical effect of reducing the workload and the calculated amount is realized.
Further, an embodiment of the present invention further provides an audio playing device, as shown in fig. 5, the audio playing device of the present invention includes:
the judging module 10 is configured to monitor whether a preset feature image exists in a currently output video picture;
the obtaining module 20 is configured to obtain the sound production position information of the preset feature image if the preset feature image exists;
the adjusting module 30 is configured to obtain an original audio signal corresponding to the video frame, adjust the original audio signal according to the sounding position information, and obtain a target audio signal after a sound field position is reconstructed;
and the playing module 40 is configured to output the target audio signal for audio playing.
Preferably, the obtaining module is further configured to:
acquiring distance information of the preset feature image according to the size of the preset feature image;
and acquiring the coordinate information of the sound production position of the preset characteristic image, and taking the coordinate information of the sound production position and the distance information as the sound production position information.
Preferably, the adjusting module is further configured to:
adjusting the voice signal in the original audio signal according to the sounding position information to obtain a target voice signal with a reconstructed sound field position;
and mixing the target audio signal with the background audio signal in the original audio signal to obtain the target audio signal with the reconstructed sound field position.
Preferably, the adjusting module is further configured to:
respectively acquiring a first coefficient and a second coefficient according to the sounding position coordinate information and the distance information;
adjusting parameters of the voice signal according to the first coefficient to obtain a left channel voice enhancement signal;
adjusting the parameters of the voice signal according to the second coefficient to obtain a right channel voice enhancement signal;
and taking the left channel voice enhancement signal and the right channel voice enhancement signal as target voice signals after the sound field position is rebuilt.
Preferably, the apparatus further comprises:
and the separation module is used for separating the original audio signal to obtain the background sound signal and the voice signal.
Preferably, the obtaining module is further configured to:
extracting the video picture in the currently output video data at intervals of preset duration;
and identifying the face image in the video picture so as to monitor whether the image with the opened lips of the person exists in the face image.
Preferably, the playing module is further configured to:
sending the target audio signal to a power amplifier to convert the target audio signal into a corresponding analog signal;
and driving a corresponding loudspeaker through the analog signal so as to play audio.
The steps implemented when the functional modules of the audio playing apparatus of the present invention are operated can refer to the embodiments of the audio playing method of the present invention, and are not described herein again.
Further, an embodiment of the present invention further provides an audio playing device, where the audio playing device includes: the audio playing program is configured to implement the steps of the audio playing method provided by the above embodiments, and specific implementation steps may refer to the above embodiments and are not described herein in detail.
Further, an embodiment of the present invention further provides a computer-readable storage medium, where an audio playing program is stored on the computer-readable storage medium, and when the audio playing program is executed by a processor, the steps of the audio playing method provided in the foregoing embodiment are implemented, and specific implementation steps may refer to the foregoing embodiment, which is not described herein again.
The apparatus, the audio playing device, and the computer readable storage medium provided in the embodiments of the present invention are used for implementing the audio playing method provided in the embodiments, and solve the technical problems that a voice position cannot be accurately restored when audio is played, and a presence of a voice is poor.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An audio playing method, characterized in that the audio playing method comprises the following steps:
monitoring whether a preset characteristic image exists in a currently output video picture;
if so, acquiring the sounding position information of the preset characteristic image;
acquiring an original audio signal corresponding to the video picture, and adjusting the original audio signal according to the sounding position information to obtain a target audio signal with a reconstructed sound field position;
and outputting the target audio signal for audio playing.
2. The audio playback method according to claim 1, wherein the step of obtaining the utterance position information of the preset feature image includes:
acquiring distance information of the preset feature image according to the size of the preset feature image;
and acquiring the coordinate information of the sound production position of the preset characteristic image, and taking the coordinate information of the sound production position and the distance information as the sound production position information.
3. The audio playing method according to claim 2, wherein the step of adjusting the original audio signal according to the utterance position information to obtain a target audio signal after reconstructing the sound field position comprises:
adjusting the voice signal in the original audio signal according to the sounding position information to obtain a target voice signal with a reconstructed sound field position;
and mixing the target audio signal with the background audio signal in the original audio signal to obtain the target audio signal with the reconstructed sound field position.
4. The audio playing method according to claim 3, wherein the step of adjusting the speech signal in the original audio signal according to the utterance position information to obtain a target speech signal with a reconstructed sound field position comprises:
respectively acquiring a first coefficient and a second coefficient according to the sounding position coordinate information and the distance information;
adjusting parameters of the voice signal according to the first coefficient to obtain a left channel voice enhancement signal;
adjusting the parameters of the voice signal according to the second coefficient to obtain a right channel voice enhancement signal;
and taking the left channel voice enhancement signal and the right channel voice enhancement signal as target voice signals after the sound field position is rebuilt.
5. The audio playing method according to claim 3, wherein before the step of adjusting the speech signal in the original audio signal according to the utterance position information to obtain the target speech signal after reconstructing the sound field position, the method further comprises:
and separating the original audio signal to obtain the background sound signal and the voice signal.
6. The audio playing method according to any one of claims 1 to 5, wherein the preset feature image is: the step of monitoring whether the preset characteristic image exists in the currently output video picture or not comprises the following steps:
extracting the video picture in the currently output video data at intervals of preset duration;
and identifying the face image in the video picture so as to monitor whether the image with the opened lips of the person exists in the face image.
7. The audio playback method according to any one of claims 1 to 5, wherein the step of outputting the target audio signal for audio playback comprises:
sending the target audio signal to a power amplifier to convert the target audio signal into a corresponding analog signal;
and driving a corresponding loudspeaker through the analog signal so as to play audio.
8. An audio playback apparatus, comprising:
the judging module is used for monitoring whether a preset characteristic image exists in a currently output video picture;
the acquisition module is used for acquiring the sounding position information of the preset characteristic image if the preset characteristic image exists;
the adjusting module is used for acquiring an original audio signal corresponding to the video picture, adjusting the original audio signal according to the sounding position information and obtaining a target audio signal after the sound field position is reconstructed;
and the playing module is used for outputting the target audio signal to play audio.
9. An audio playback apparatus, characterized in that the apparatus comprises: memory, a processor and an audio playback program stored on the memory and executable on the processor, the audio playback program being configured to implement the steps of the audio playback method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon an audio playback program which, when executed by a processor, implements the steps of the audio playback method according to any one of claims 1 to 7.
CN202210632201.8A 2022-06-02 2022-06-02 Audio playing method, device, equipment and computer readable storage medium Pending CN114822568A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210632201.8A CN114822568A (en) 2022-06-02 2022-06-02 Audio playing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210632201.8A CN114822568A (en) 2022-06-02 2022-06-02 Audio playing method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114822568A true CN114822568A (en) 2022-07-29

Family

ID=82520687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210632201.8A Pending CN114822568A (en) 2022-06-02 2022-06-02 Audio playing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114822568A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002401A (en) * 2022-08-03 2022-09-02 广州迈聆信息科技有限公司 Information processing method, electronic equipment, conference system and medium
CN116320144A (en) * 2022-09-23 2023-06-23 荣耀终端有限公司 Audio playing method and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002401A (en) * 2022-08-03 2022-09-02 广州迈聆信息科技有限公司 Information processing method, electronic equipment, conference system and medium
CN115002401B (en) * 2022-08-03 2023-02-10 广州迈聆信息科技有限公司 Information processing method, electronic equipment, conference system and medium
CN116320144A (en) * 2022-09-23 2023-06-23 荣耀终端有限公司 Audio playing method and electronic equipment
CN116320144B (en) * 2022-09-23 2023-11-14 荣耀终端有限公司 Audio playing method, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN114822568A (en) Audio playing method, device, equipment and computer readable storage medium
JP2022550372A (en) Method and system for creating binaural immersive audio for audiovisual content
JP2013093840A (en) Apparatus and method for generating stereoscopic data in portable terminal, and electronic device
US20240096343A1 (en) Voice quality enhancement method and related device
CN109310525A (en) Media compensation passes through and pattern switching
CN115482830B (en) Voice enhancement method and related equipment
CN114830233A (en) Adjusting audio and non-audio features based on noise indicator and speech intelligibility indicator
US20230260525A1 (en) Transform ambisonic coefficients using an adaptive network for preserving spatial direction
CN114422935B (en) Audio processing method, terminal and computer readable storage medium
CN114067810A (en) Audio signal rendering method and device
WO2022170716A1 (en) Audio processing method and apparatus, and device, medium and program product
CN111787464B (en) Information processing method and device, electronic equipment and storage medium
CN114501297B (en) Audio processing method and electronic equipment
WO2022262576A1 (en) Three-dimensional audio signal encoding method and apparatus, encoder, and system
CN108141693B (en) Signal processing apparatus, signal processing method, and computer-readable storage medium
US20200184973A1 (en) Transcription of communications
CN112333531A (en) Audio data playing method and device and readable storage medium
WO2018088210A1 (en) Information processing device and method, and program
CN116320144B (en) Audio playing method, electronic equipment and readable storage medium
CN116347320B (en) Audio playing method and electronic equipment
CN113194400B (en) Audio signal processing method, device, equipment and storage medium
US20230274623A1 (en) Method and system for synchronizing a viewer-effect signal of a media content with a media signal of the media content
US20230267942A1 (en) Audio-visual hearing aid
Luo et al. Multi-Modality Speech Recognition Driven by Background Visual Scenes
WO2024036113A1 (en) Spatial enhancement for user-generated content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination