WO2020048034A1 - 实现声音与图像同位的方法、装置、设备及存储介质 - Google Patents

实现声音与图像同位的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020048034A1
WO2020048034A1 PCT/CN2018/120528 CN2018120528W WO2020048034A1 WO 2020048034 A1 WO2020048034 A1 WO 2020048034A1 CN 2018120528 W CN2018120528 W CN 2018120528W WO 2020048034 A1 WO2020048034 A1 WO 2020048034A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
image
currently playing
sound source
video
Prior art date
Application number
PCT/CN2018/120528
Other languages
English (en)
French (fr)
Inventor
赵新科
Original Assignee
深圳创维-Rgb电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳创维-Rgb电子有限公司 filed Critical 深圳创维-Rgb电子有限公司
Publication of WO2020048034A1 publication Critical patent/WO2020048034A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440218Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data

Definitions

  • Embodiments of the present disclosure relate to the technical field of smart TVs, for example, to a method, a device, a device, and a storage medium for realizing the co-location of sound and image.
  • electronic display products such as large-size LCD TVs
  • display video images when displaying videos, and video sounds are generated through speakers set elsewhere in the TV. Because video sounds and corresponding videos The images are not played in the same location, which causes the video to play poorly, and the user's sense of presence when watching the video is not strong.
  • the present disclosure provides a method, a device, a device, and a storage medium for realizing the co-location of sound and image.
  • the method can effectively realize the co-location presentation of sound and image and improve the playback effect of video.
  • an embodiment of the present disclosure provides a method for realizing co-location of sound and image.
  • the method includes:
  • a control signal is generated according to the position information of the sound source on the video display screen, so as to control according to the control signal
  • a sound reduction element corresponding to the position information emits a sound
  • the preset image feature database is constructed in advance according to the currently playing video.
  • performing image recognition and sound recognition on the currently playing video to obtain the image features and sound features corresponding to the currently playing video includes:
  • An image recognition interface is called for image recognition based on the image data to obtain image features corresponding to the image data
  • a sound recognition interface is called for sound recognition based on the sound data to obtain a sound feature corresponding to the sound data.
  • the sound reduction element is independently set according to a pre-divided partition of the video display screen
  • the number of the partitions is set according to the size of the display screen.
  • the sound reduction element includes a speaker.
  • the generating a control signal according to the position information of the sound source on the video display screen to control the sound reduction element corresponding to the position information to sound according to the control signal includes:
  • the determining that a sound source exists in the currently playing video based on the image characteristics includes:
  • the image features in the preset image feature database include at least one of the following: human form features and animal form features.
  • the determining that a sound source matching the sound source exists in the currently playing video based on the sound characteristics includes:
  • controlling the sound of the sound reduction element corresponding to the position information includes:
  • the sound gain of the sound reduction element is increased.
  • the currently played video is obtained by sampling the played video according to a preset number of sampling times in a unit time.
  • the video display screen includes a liquid crystal display screen of a preset size.
  • an embodiment of the present disclosure provides a device for realizing co-location of sound and image.
  • the device includes:
  • a recognition module configured to separately perform image recognition and sound recognition on a currently playing video to obtain image features and sound features corresponding to the currently playing video;
  • the obtaining module is configured to, when it is determined that a sound source exists in the currently playing video based on the image feature, obtain a sound source of the currently playing video on a video display screen from a preset image feature database based on the image feature. location information;
  • a control module configured to generate a control signal according to the position information of the sound source on the video display when it is determined that a sound source matching the sound source exists in the currently playing video based on the sound characteristics, so as to The control signal controls the sound of the sound reduction element corresponding to the position information;
  • the preset image feature database is constructed in advance according to the currently playing video.
  • an embodiment of the present disclosure provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer
  • the program implements the above method.
  • an embodiment of the present disclosure provides a storage medium including computer-executable instructions that implement the foregoing methods when executed by a computer processor.
  • FIG. 1a is a schematic flowchart of a method for achieving sound and image parity provided by Embodiment 1 of the present disclosure
  • FIG. 1b is a schematic flowchart of another method for realizing sound and image parity provided by Embodiment 1 of the present disclosure
  • FIG. 2 is a schematic flowchart of still another method for implementing sound and image parity provided by Embodiment 1 of the present disclosure
  • FIG. 3 is a schematic flowchart of a method for achieving sound and image parity provided by Embodiment 2 of the present disclosure
  • FIG. 4 is a schematic diagram of a partition of a display screen provided in Embodiment 2 of the present disclosure.
  • FIG. 5 is a schematic flowchart of controlling a corresponding power amplifier to work to drive a speaker to generate sound according to the control signal according to Embodiment 2 of the present disclosure
  • FIG. 6 is a schematic structural diagram of a device for realizing sound and image co-location provided in Embodiment 3 of the present disclosure
  • FIG. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure.
  • FIG. 1a is a schematic flowchart of a method for achieving co-location of sound and image according to Embodiment 1 of the present disclosure.
  • the method for realizing the co-location of sound and image provided by this embodiment can be applied to electronic products with a large-sized display screen, such as television products of 65 inches and above.
  • a large-sized display screen such as television products of 65 inches and above.
  • the method for achieving sound and image parity is suitable for playing a video with sound characteristics with obvious directional properties.
  • the video with sound characteristics with obvious directional properties is, for example, a video containing a character and the character speaks a voice, a quarrel sound or a singing voice, or contains an animal and the animal Videos with vocal sounds, or videos containing objects that make a knocking sound (such as iron, welding, etc.) or the sound of shattering objects (such as the sound of ceramic products such as glass or bowls), have obvious Video with sound characteristics and source of sound.
  • the method for achieving parity of sound and image provided by this embodiment cannot be performed normally.
  • the method for achieving sound and image parity may be performed by a device for achieving sound and image parity.
  • the device may be implemented by software and / or hardware, and is generally integrated in an electronic device having a large-sized display screen.
  • the method for realizing the co-location of sound and image is used to improve the video playback effect, and enhance the user's sense of presence and immersion when watching a video. As shown in FIG. 1a, the method includes the following steps:
  • Step 10 Perform image recognition and sound recognition on the currently playing video to obtain image features and sound features corresponding to the currently playing video.
  • Step 20 When it is determined that a sound source exists in the currently playing video based on the image feature, obtain position information of the sound source on the video display screen from a preset image feature database based on the image feature.
  • Step 30 When it is determined that a sound source matching the sound source exists in the currently playing video based on the sound characteristics, generate a control signal according to the position information of the sound source on the video display screen, so as to The control signal controls the sound of the sound reduction element corresponding to the position information.
  • the preset image feature database is constructed in advance according to the currently playing video.
  • FIG. 1b is a schematic flowchart of another method for implementing sound and image parity provided by Embodiment 1 of the present disclosure. As shown in FIG. 1b, the method includes:
  • Step 110 Perform image recognition and sound recognition on the currently playing video to obtain image features and sound features corresponding to the currently playing video.
  • the image feature refers to a feature of a person, an animal, or other objects or items included in the currently playing video.
  • the object may be, for example, a wooden bench or a wooden table, and the item may be, for example, a bowl, a cup, or a teapot.
  • Such as ceramic products as long as it is a feature of the object included in the currently playing video, it belongs to the category of image features.
  • the sound feature refers to an audio feature included in the currently playing video, such as singing, talking, animal shouting, or the sound of a broken object.
  • the currently playing video is obtained by periodically sampling the playing video, for example, the playing video is sampled twice (that is, a preset number of samples) every second (that is, within a unit time), and the video obtained by each sampling is currently being played. video. That is, the currently playing video is obtained by sampling the played video according to a preset number of sampling times in a unit time.
  • performing image recognition and sound recognition on the currently playing video to obtain image features and sound features corresponding to the currently playing video includes:
  • An image recognition interface is called for image recognition based on the image data to obtain image features corresponding to the image data
  • a sound recognition interface is called for sound recognition based on the sound data to obtain a sound feature corresponding to the sound data.
  • the decoding of the currently playing video may be performed using a mature decoding algorithm in the art, which is not described in this embodiment.
  • the image recognition interface is Baidu's face recognition system, which can effectively recognize image features in the currently playing video.
  • the voice recognition interface is a program module for extracting information such as amplitude or frequency in a voice that can reflect the characteristics of the voice, and the program module can effectively extract the voice characteristics in the currently playing video.
  • Step 120 Determine whether a sound source exists in the currently playing video based on the image characteristics. If a sound source exists in the currently playing video, proceed to step 130; if no sound source exists in the currently playing video, return to executing step 110.
  • the sound source refers to the source of the sound in the currently playing video, such as a person or an object emitting the sound.
  • the determining that a sound source exists in the currently playing video based on the image characteristics includes:
  • the image features in the preset image feature database include human form features and / or animal form features, and may also include the morphological features of objects and articles.
  • the human form features may refer to the mouth shape of a person when making a sound
  • the animal morphological characteristics may refer to the mouth-shaped posture of the animal when it makes a sound
  • the morphological characteristics of the object or article may refer to the posture of the object or the article when it makes a sound, such as a knocking or rubbing posture between objects, and an object is broken. Gesture.
  • the essence of the image feature is the identification of the sound source present in the currently playing video.
  • the image features in the preset image feature database can be obtained by learning the current playing video in advance through an autonomous learning function based on artificial intelligence technology, and simultaneously displaying the image features in the preset image feature database on the current video playback device.
  • the position of the screen is marked. Therefore, in the process of autonomous learning, the screen size information of the electronic device set to play the current video must be added. Considering the cost, the electronic device that plays the current video can only consider the more popular configurations on the market. Smart TV with 65-inch LCD screen.
  • the currently playing video can be obtained by periodically sampling the playing video, by setting the set threshold, some images with insignificant image characteristics can be filtered out, that is, the video data obtained from the current sampling is discarded, and the next sampling data is waiting. Therefore, the method for achieving the co-location of sound and image can reduce the occupation rate of system resources, and can improve the accuracy of determining whether a sound source exists in the currently playing video.
  • Step 130 Obtain position information of a sound source of the currently playing video on a video display screen from a preset image feature database based on the image features.
  • the video display screen refers to a display screen of an electronic device configured to play a video.
  • the preset image feature database is pre-built according to the currently playing video, and the preset image feature database stores a correspondence between the sound source of the currently playing video and its position information on the video display screen, and the correspondence The position information of the sound source on the video display screen can be found.
  • Step 140 Determine whether a sound source matching the sound source exists in the currently playing video based on the sound characteristics. If a sound source matching the sound source exists in the currently playing video, proceed to step 150. If there is a sound source matching the sound source, the process returns to step 110.
  • determining that a sound source matching the sound source exists in the currently playing video based on the sound characteristics includes:
  • the model characteristics of the model sound source are pre-built according to the currently playing video, and the sound source existing in the currently playing video and the sound characteristics corresponding to the sound source are stored in the model sound source, for example, the sound source is a character ,
  • the corresponding sound feature is singing, which means that the person corresponding to the sound source in the currently playing video is singing; if the sound source is a character, and the corresponding sound feature is a dog barking sound, it means that The character corresponding to the sound source is learning dog barking; if the sound source is a glass vase and the corresponding sound feature is the sound of glass breaking, it means that the sound source corresponding to the glass vase in the currently playing video has a glass breaking sound .
  • the sound feature identified from the currently playing video and the determined sound source match the model feature and the corresponding sound source in the model sound source, it means that there is a sound feature with a clear directional property in the currently playing video. , That is, the sound source.
  • the sound feature refers to an audio feature included in a currently playing video, such as singing voice, talking voice, animal shouting voice, or the sound of a broken object.
  • Step 150 Generate a control signal according to the position information of the sound source on the video display screen to control the sound of the sound reduction element corresponding to the position information.
  • the sound reduction element includes a speaker, and the sound reduction element is independently set according to a pre-divided partition of the video display screen;
  • the number of partitions is set according to the size of the display screen.
  • the video display screen may be divided into a specific number of small areas in advance, and each small area is uttered by an independent sound reduction element. By determining which small area the sound source is in, the corresponding small area is controlled. The sound reduction element in the area emits sound, thereby achieving the purpose of co-locating the sound with the image, and giving the user watching the video a live effect that the sound is emitted by the sound source.
  • the viewer can feel that the words of reprimand the minion are just sent from the queen's maiden's mouth, to The viewer has a stronger presence experience, which enhances the viewer's immersion.
  • the content of the currently playing video is "birds fly through the woods and make pleasant bird sounds.”
  • the viewer feels that the bird sounds come from a certain position on the display screen. Sent by birds, giving viewers a stronger presence experience.
  • the method for realizing the co-location of sound and image identifies the sound source and corresponding sound features in the currently playing video through image recognition and sound recognition.
  • the current Play the position information of the sound source on the video display screen in the video, and control the sound reduction element at the sound source to sound according to the position information, thereby achieving the parity of the sound and the image, giving a video sound from the corresponding sound source
  • the sentiment enhances the presence and immersion of the viewer.
  • this embodiment provides another schematic flowchart of a method for achieving co-location of sound and image.
  • the method includes:
  • Step 210 The video starts playing.
  • Step 220 Perform video sampling on the video being played.
  • this embodiment uses the sampling frequency of 2 times per second to sample the video being played. This minimizes the occupation of system resources while ensuring that the sound and image are not aligned. The effect of this method does not leave any sound source with obvious directional properties in the video.
  • Step 230 Perform video decoding on the sampled video.
  • video decoding is performed on the sampled video to obtain image data and sound data in the video, respectively.
  • the video decoding of the sampled video may be performed using a mature decoding algorithm in the art, which is not described in this embodiment.
  • Step 231 Obtain sound data in the video.
  • Step 240 Obtain image data in the video.
  • Step 250 Perform image recognition according to the image data to obtain image features.
  • an image recognition operation can be performed by calling an image recognition interface.
  • the image recognition interface is Baidu's face recognition system, which can effectively recognize image features in the currently playing video.
  • Step 260 Match the image features with the image features in the image database.
  • the image database is constructed in advance according to the currently playing video, and the image characteristics of the sound source existing in the currently playing video are stored therein.
  • step 270 it is confirmed whether matching data is obtained. If the matching data is obtained, step 280 is performed. If no matching data is obtained, the current sampling data is discarded and the next sampling is performed.
  • the essence of confirming whether to obtain the matching data is to determine whether there is data matching the image characteristics in the image database. If there is data matching the image characteristics in the image database, perform step 280 to perform voice recognition based on the voice data. To obtain a sound feature; if there is no data matching the image feature in the image database, the current sampling data is discarded and the next sampling is performed.
  • Step 280 Perform voice recognition according to the voice data to obtain voice characteristics.
  • Step 290 Match the sound feature with a sound feature in a sound database.
  • the sound database is pre-built according to the currently playing video, and the feature data of the sound emitted by the sound source in the currently playing video is stored therein.
  • step 2100 it is confirmed whether matching data is obtained. If the matching data is obtained, step 2110 is performed. If no matching data is obtained, the current sampling data is discarded and the next sampling is performed.
  • the essence of confirming whether to obtain the matching data is to determine whether there is data matching the sound characteristics in the sound database.
  • the sound field control information is control information for controlling the sound emission of the speaker at the position of the current display screen by the sound source.
  • step 2110 the sound field control information is output according to the position information of the sound source on the video display screen, so as to control the corresponding sound field to sound.
  • the purpose of determining whether there is a sound source with obvious directional properties in the video data is achieved.
  • the video data continues to be identified by sound features.
  • the sound of the speaker of the sound source at the position of the display screen is controlled to realize the parity of the sound and the image, improve the video playback effect, and bring a stronger presence to the viewer Sense experience.
  • FIG. 3 is a schematic flowchart of a method for achieving sound and image parity provided by Embodiment 2 of the present disclosure. Based on the above embodiments, this embodiment describes the implementation process of the sound source sound reduction. As shown in FIG. 3, the method includes:
  • Step 310 Decode the currently playing video to obtain image data and sound data corresponding to the currently playing video, respectively.
  • Step 320 Call an image recognition interface for image recognition based on the image data to obtain image features corresponding to the image data, and call a sound recognition interface for sound recognition based on the sound data to obtain a sound corresponding to the sound data. feature.
  • Step 330 Determine whether a sound source exists in the currently playing video based on the image characteristics. If a sound source exists in the currently playing video, proceed to step 340; if no sound source exists in the currently playing video, return to executing step 310.
  • Step 340 Obtain position information of a sound source of the currently playing video on a video display screen from a preset image feature database based on the image features.
  • Step 350 Determine whether a sound source matching the sound source exists in the currently playing video based on the sound characteristics. If a sound source matching the sound source exists, proceed to step 360; if there is no sound source matching the sound source If the sound source matches, the process returns to step 310.
  • Step 360 Generate a control signal according to the position information of the sound source on the video display screen.
  • the display screen of the electronic device configured to play the video can be divided into sounds.
  • This is a basic premise of the method for realizing the co-location of sound and image provided by the embodiment of the present disclosure. Only the sound is installed at the corresponding position of the display screen. The component can realize the sound effect with the presence. However, since the sound source has the region size attribute, it is impossible to achieve absolute parity of the image and sound. If the virtual sound algorithm is used to make sound on the display screen, video image recognition and sound field virtual application will be performed in real time, which will take up more Central Processing Unit (CPU) resources.
  • CPU Central Processing Unit
  • the display screen is divided into a specific number of partitions in advance, and an independent sound field is virtualized for each partition. Virtualization is implemented for each partition by configuring independent speakers for each partition. An independent sound field.
  • the video display screen includes a liquid crystal display screen of a preset size. Exemplarily, the preset size may be 65 inches (ie, 6.5 feet).
  • Figure 4 is a schematic diagram of the partition of a display screen. As shown in Figure 4, in order to save system resources and reflect the effect of parity, the 65-inch and larger display screens are divided into 6 equal areas, respectively.
  • the 6 virtual sound fields are sound field 1, sound field 2, sound field 3, sound field 4, sound field 5 and sound field 6; each virtual sound field is realized by independent speakers, and the corresponding 6 speakers are respectively installed on the display The upper left, middle left, lower left, upper right, middle right and lower right positions of the screen.
  • the two speakers of sound field 1 and sound field 2 are driven by a first power amplifier to restore the sound of two sound fields of sound field 1 and sound field 2; the two speakers of sound field 3 and sound field 4 are driven by a second power amplifier to restore sound field 3 and sound field 4
  • the sound of two sound fields; the two speakers corresponding to sound field 5 and sound field 6 are driven by a third power amplifier to restore the sound of two sound fields of sound field 5 and sound field 6.
  • a sound source refers to a video signal with sound information.
  • the sound source can be obtained by decoding the video.
  • the sound source decoded from the video that is, the sound data can be decoded to separate sound data from multiple directions.
  • DTS Digital Theater System
  • only ATMOS decoding can be used to decode two-channel sound into 8-channel sound.
  • the decoded sounds in six directions are obtained, that is, the sound signals in six directions: sound field 1, sound field 2, sound field 3, sound field 4, sound field 5, and sound field 6.
  • FIG. 5 A schematic flowchart of decoding a sound source and controlling a power amplifier corresponding to the position information to drive a speaker to generate sound according to the control signal may be shown in FIG. 5, and the method includes:
  • Step 510 Acquire a sound source.
  • the audio source in the video can be obtained by decoding the video.
  • Step 520 Decode the audio source by using an ATMOS chip.
  • the ATMOS chip is configured in an electronic device playing the video, and the IIS audio signal includes control logic for controlling the first power amplifier, the second power amplifier, and the third power amplifier.
  • step 530 an IIS audio signal is obtained.
  • Step 540 Send the sound field control information to the IIS audio signal.
  • the sound field control information is control information for controlling the sound field of the sound source at the position of the current display screen, and the speaker at the position is driven to work by the power amplifier at the position.
  • the purpose of sending the sound field control information to the IIS audio signal is to encode the sound field control information to the IIS audio signal.
  • the IIS audio signal is a digital signal that internally modulates sound signals in multiple directions in the video, and the sound field control information contains the sound field position information to be triggered.
  • the sound field control information is described to select which direction of the sound in the IIS audio signal is restored. Therefore, the sound field control information can be encoded into the IIS audio signal and restored together as a model signal.
  • the process of restoring the sound in the video is: decoding the IIS audio signal to obtain the sound field 1, sound field 2, sound field 3, sound field 4, sound field 5, and sound field 6, 6 directions.
  • Corresponding sound, and the sound field control information obtained by decoding is used to control the work of the power amplifier at the position to drive the corresponding speaker to make sound, so as to restore the sound in the direction at the position.
  • the sound field control information is control information that triggers sound field 3
  • the IIS audio signal is restored, only the sound signal in the direction of sound field 3 is restored, and no sound signal is transmitted in other sound field areas.
  • Step 370 Decode the sound data through ATMOS to obtain an IIS audio signal.
  • Step 380 Control the power amplifier corresponding to the position information to work according to the IIS audio signal and the control signal to drive a corresponding speaker to make a sound.
  • the speaker makes a sound, and the position where a sound is presented is basically the same as the position of the person who made the sound, which brings a strong presence experience to the viewer.
  • controlling the sound of the sound reduction element corresponding to the position information includes:
  • the sound gain of the sound reduction element is reduced, and when the sound amplitude of the sound reduction element does not exceed the set lower limit, the sound is increased Reduces the sound gain of the element.
  • a sound amplitude dynamic adjustment technology that is, a professional sound effect algorithm is used to control the sound amplitude within a set range, and when the sound amplitude is lower than the set lower limit, the gain of the speaker corresponding to the sound source position is increased ; When the sound amplitude exceeds the set upper limit, the gain of the speaker corresponding to the sound source position is reduced, so that the volume of the video at any time is within the set range.
  • This embodiment provides a method for realizing the co-location of sound and image.
  • FIG. 6 is a schematic structural diagram of a device for realizing sound and image co-location provided in Embodiment 3 of the present disclosure; as shown in FIG. 6, the device includes: an identification module 610, an acquisition module 620, and a control module 630;
  • the recognition module 610 is configured to perform image recognition and sound recognition on the currently playing video to obtain the image features and sound features corresponding to the currently playing video.
  • the obtaining module 620 is configured to obtain the image feature based on the image feature. If there is a sound source of the currently playing video, the position information of the sound source of the currently playing video on the video display is obtained from a preset image feature database based on the image characteristics; the control module 630 is configured to When a sound source matching the sound source is present in the currently playing video, a control signal is generated according to the position information of the sound source on the video display screen, so as to control the position information according to the control signal.
  • the corresponding sound reduction element emits sound; wherein the preset image feature database is pre-built according to the currently playing video.
  • the recognition module 610 is configured to decode the currently playing video to obtain the image data and sound data corresponding to the currently playing video respectively; and call the image recognition interface for image recognition based on the image data to obtain the image data. Image features corresponding to the data, and call a voice recognition interface for voice recognition based on the voice data to obtain a voice feature corresponding to the voice data.
  • the sound reduction element is independently set according to a pre-divided partition of the video display screen
  • the number of partitions is set according to the size of the display screen.
  • control module 630 is configured to decode the sound data through ATMOS to obtain an IIS audio signal; and control the power amplifier corresponding to the position information to work according to the IIS audio signal and the control signal.
  • the speaker corresponding to the position information is driven to sound.
  • the acquisition module 620 includes a sound source determination sub-module 640; the sound source determination sub-module 640 is configured to determine that a sound source exists in the currently playing video based on the image characteristics.
  • the sound source determination sub-module 640 includes:
  • a matching unit configured to perform similarity matching between the image feature and an image feature in a preset image feature database
  • a determining unit configured to determine that a sound source exists in the currently playing video when the similarity of the matching reaches a set threshold
  • the image features in the preset image feature database include human morphological features and / or animal morphological features.
  • control module 630 includes a sound source determination sub-module 650; the sound source determination sub-module 650 is configured to determine, based on the sound characteristics, that a sound source matching the sound source exists in the currently playing video.
  • the sound source determination sub-module 650 is configured to: compare the sound feature with a model feature of a pre-established model sound source; if there is a model feature consistent with the sound feature and the model feature The corresponding model sound source is the same as the sound source existing in the currently playing video, then it is determined that a sound source matching the sound source exists in the currently playing video.
  • control module 630 is further configured to: when the amplitude of the sound emitted by the sound reduction element exceeds a set upper limit, reduce the sound gain of the sound reduction element, and when the sound emitted by the sound reduction element When the amplitude does not exceed the set lower limit, the sound generating gain of the sound reduction element is increased.
  • the device for realizing the co-location of sound and image identifies the sound source and corresponding sound features in the currently playing video through image recognition and sound recognition.
  • the current Play the position information of the sound source on the video display screen in the video, and control the sound reduction element at the sound source to sound according to the position information, thereby achieving the parity of the sound and the image, giving a video sound from the corresponding sound source
  • the sentiment enhances the presence and immersion of the viewer.
  • the technical solution of the present disclosure realizes the presentation of sounds and images at the same position, so that the user who views the video feels that the position of the video sound is basically the same as the position of the object emitting the sound in the video, which improves the playback effect of the video and the user experience.
  • FIG. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure.
  • the electronic device includes: a processor 770, a memory 771, and a computer program stored on the memory 771 and executable on the processor 770.
  • the number of the processors 770 may be one or more.
  • a processor 770 is taken as an example in 7; when the processor 770 executes the computer program, the method for implementing sound and image parity as described in the foregoing embodiment is implemented.
  • the electronic device may further include an input device 772 and an output device 773.
  • the processor 770, the memory 771, the input device 772, and the output device 773 may be connected through a bus or other methods. In FIG. 7, the connection through the bus is taken as an example.
  • the memory 771 is a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions / modules corresponding to the method for implementing sound and image parity in the embodiments of the present disclosure (for example, implementing sound and image (Identification module 610, acquisition module 620, control module 630, etc. in a co-located device).
  • the processor 770 executes various functional applications and data processing of the electronic device by running software programs, instructions, and modules stored in the memory 771, that is, the above-mentioned method for realizing parity of sound and image.
  • the memory 771 may include a storage program area and a storage data area, where the storage program area may store an operating system and application programs required for at least one function; the storage data area may store data created according to the use of the terminal, and the like.
  • the memory 771 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage device.
  • the memory 771 may include memory remotely disposed with respect to the processor 770, and these remote memories may be connected to the electronic device / storage medium through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 772 may be configured to receive inputted numeric or character information and generate key signal inputs related to user settings and function control of the electronic device.
  • the output device 773 may include a display device such as a display screen.
  • Embodiment 5 of the present disclosure also provides a storage medium containing computer-executable instructions.
  • the method is used to implement a method for achieving parity of sound and image.
  • the method includes:
  • a control signal is generated according to the position information of the sound source on the video display screen, so as to control according to the control signal
  • a sound reduction element corresponding to the position information emits a sound
  • the preset image feature database is constructed in advance according to the currently playing video.
  • a storage medium including computer-executable instructions provided by the embodiments of the present disclosure is not limited to the method operations described above, and may also implement the realization of sound and image parity provided by any embodiment of the present disclosure. Related operations in the method.
  • the present disclosure may be implemented by software and general hardware, and may also be implemented by hardware. Based on such an understanding, the technical solution of the present disclosure that is essential or contributes to related technologies may be embodied in the form of a software product.
  • the computer software product may be stored in a computer-readable storage medium, such as a computer floppy disk, Read-only memory (ROM), random access memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to make a computer device (can be a personal computer , Storage medium, or network device, etc.) to perform the method described in one or more embodiments of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Stereophonic System (AREA)

Abstract

本文公开了一种实现声音与图像同位的方法,包括:对当前播放视频分别进行图像识别以及声音识别,以获取当前播放视频对应的图像特征和声音特征;在基于所述图像特征确定所述当前播放视频存在发声源的情况下,则获取当前播放视频的发声源在视频显示屏的位置信息;在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在视频显示屏的位置信息生成控制信号,以控制与所述位置信息对应的声音还原元件发声。本文还公开了一种实现声音与图像同位的装置、设备以及存储介质。

Description

实现声音与图像同位的方法、装置、设备及存储介质
本申请要求在2018年9月7日提交中国专利局、申请号为201811043120.4的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及智能电视技术领域,例如涉及一种实现声音与图像同位的方法、装置、设备及存储介质。
背景技术
目前,电子显示类产品,例如大尺寸液晶电视,在播放视频的时候,视频图像是通过显示屏呈现出来,而视频声音则是通过设置在电视其它位置的扬声器发声,由于视频声音与对应的视频图像不在同一个位置播放,导致视频的播放效果不佳,用户观看视频时的临场感不强。
发明内容
本公开提供一种实现声音与图像同位的方法、装置、设备及存储介质,通过所述方法有效实现声音与图像的同位置呈现,提高视频的播放效果。
在一实施例中,本公开实施例提供了一种实现声音与图像同位的方法,所述方法包括:
对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;
在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述发声源在视频显示屏的位置信 息;
在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;
其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
在一实施例中,所述对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征,包括:
对所述当前播放视频进行解码,分别得到所述当前播放视频对应的图像数据和声音数据;
基于所述图像数据调用图像识别接口进行图像识别,得到与所述图像数据对应的图像特征,并基于所述声音数据调用声音识别接口进行声音识别,得到与所述声音数据对应的声音特征。
在一实施例中,所述声音还原元件依据所述视频显示屏预先划分的分区独立设置;
其中,所述分区的数量依据显示屏的尺寸进行设定。
在一实施例中,所述声音还原元件包括扬声器。
在一实施例中,所述根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声包括;
将所述声音数据通过杜比全景声(Atmosphere,ATMOS)解码,得到集成音频接口(Integrate Interface of Sound,IIS)音频信号;
根据所述IIS音频信号以及所述控制信号控制与所述位置信息对应的功率放大器工作以驱动所述位置信息对应的扬声器发声。在一实施例中,所述基于 所述图像特征确定所述当前播放视频存在发声源,包括:
将所述图像特征与预设图像特征数据库中的图像特征进行相似度匹配;
当匹配的所述相似度达到设定阈值时,则确定所述当前播放视频存在发声源;
其中,所述预设图像特征数据库中的图像特征包括下述至少一项:人体形态特征和动物形态特征。在一实施例中,所述基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源,包括:
将所述声音特征与预先建立的模型发声源的模型特征进行比对;
若存在与所述声音特征一致的模型特征且所述模型特征对应的模型发声源与所述当前播放视频中存在的发声源相同,则确定所述当前播放视频存在与所述发声源匹配的音源。
在一实施例中,所述控制与所述位置信息对应的声音还原元件发声,包括:
当所述声音还原元件发出的声音幅度超过设定上限时,降低所述声音还原元件的发声增益;
当所述声音还原元件发出的声音幅度没有超过所述设定下限时,提高所述声音还原元件的发声增益。
在一实施例中,所述当前播放视频通过按照单位时间内的预设采样次数对播放的视频进行采样而获得。
在一实施例中,所述视频显示屏包括预设尺寸的液晶显示屏。
在一实施例中,本公开实施例提供了一种实现声音与图像同位的装置,所述装置包括:
识别模块,设置为对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;
获取模块,设置为在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息;
控制模块,设置为在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;
其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
在一实施例中,本公开实施例提供了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述的方法。
在一实施例中,本公开实施例提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时实现上述的方法。
附图说明
图1a为本公开实施例一提供的一种实现声音与图像同位的方法流程示意图;
图1b为本公开实施例一提供的另一种实现声音与图像同位的方法流程示意图;
图2为本公开实施例一提供的又一种实现声音与图像同位的方法流程示意图;
图3为本公开实施例二提供的一种实现声音与图像同位的方法流程示意图;
图4为本公开实施例二提供的一种显示屏的分区示意图;
图5为本公开实施例二提供的一种根据所述控制信号控制对应功率放大器工作以驱动扬声器发声的流程示意图;
图6为本公开实施例三提供的一种实现声音与图像同位的装置的结构示意图;
图7为本公开实施例四提供的一种电子设备的结构示意图。
具体实施方式
实施例一
图1a为本公开实施例一提供的一种实现声音与图像同位的方法流程示意图。本实施例提供的实现声音与图像同位的方法可应用在具有大尺寸显示屏的电子产品上,例如65寸及以上的电视机产品。当显示屏尺寸较小时,由于声音还原系统与视频图像(即发声源)之间的距离较近,声音与图像同位的音响效果无法被突出体现。所述实现声音与图像同位的方法适用于具有明显方向属性的声音特征的视频的播放过程中。在一实施例中,所述具有明显方向属性的声音特征的视频例如是包含有人物且所述人物发出了说话的声音,吵架的声音或者唱歌的声音的视频,或者包含有动物且所述动物发出了叫声的视频,或者包含有物体且所述物体发出了敲打声(例如打铁、电焊等)或者打碎物体声音(例如打碎玻璃或者碗等陶瓷制品的声音)的视频,即具有明显声音特征且有发出声音的源头的视频。在具有声音但该声音没有明显方向属性的视频播放过程中,本实施例提供的实现声音与图像同位的方法无法正常执行。例如对于只包含有背景音乐的视频,由于所述背景音乐没有明显的发声源头,即不具有明显的方向属性,则无法应用本实施例提供的方法达到提升视频播放效果的目的,对于此类视频播放,只当作普通声音进行呈现,不进行声音与图像的同位操作。所述实现声音与图像同位的方法可以由实现声音与图像同位的装置来执行,该装置可由软件和/或硬件实现,一般集成在具有大尺寸显示屏的电子设备中。所述实现声音与图像同位的方法用于提升视频播放效果,提升用户观看视频的临场感,沉浸感。参见图1a所示,该方法包括如下步骤:
步骤10,对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征。
步骤20,在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述发声源在视频显示屏的位置信息。
步骤30,在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声。
其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
图1b为本公开实施例一提供的另一种实现声音与图像同位的方法流程示意图。参见图1b所示,该方法包括:
步骤110,对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征。
其中,所述图像特征是指当前播放视频中包含的人物、动物或者其他的物体、物品等的特征,所述物体例如可以是木凳或者木桌,所述物品例如可以是碗、杯子或者茶壶等陶瓷制品,只要是当前播放视频中包含的物的特征都属于所述图像特征的范畴。所述声音特征是指当前播放视频中包含的音频特征,例如唱歌声、说话声、动物叫喊声或者打碎物品的声音等。
所述当前播放视频通过对播放视频进行定期采样获得,例如每秒中(即,单位时间内)对播放视频采样两次(即,预设采样次数),每次采样得到的视频均为当前播放视频。即:所述当前播放视频通过按照单位时间内的预设采样次数对播放的视频进行采样而获得。
示例性的,所述对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征,包括:
对当前播放视频进行解码,分别得到当前播放视频对应的图像数据和声音 数据;
基于所述图像数据调用图像识别接口进行图像识别,得到与所述图像数据对应的图像特征,并基于所述声音数据调用声音识别接口进行声音识别,得到与所述声音数据对应的声音特征。
其中,对当前播放视频进行解码可以利用本领域内成熟的解码算法进行,本实施例中不再赘述。在一实施例中,所述图像识别接口为百度的人脸识别系统,可实现对当前播放视频中的图像特征进行有效识别。在一实施例中,所述声音识别接口为提取声音中的振幅或者频率等能够体现声音特征的信息的程序模块,通过该程序模块可有效提取当前播放视频中的声音特征。
步骤120,基于所述图像特征判断所述当前播放视频是否存在发声源,若所述当前播放视频存在发声源,则继续执行步骤130,若所述当前播放视频不存在发声源,则返回执行步骤110。其中,所述发声源指当前播放视频中发出声音的源头,例如发出声音的人或者物等。
示例性地,所述基于所述图像特征确定所述当前播放视频存在发声源,包括:
将所述图像特征与预设图像特征数据库中的图像特征进行相似度匹配;
当匹配的所述相似度达到设定阈值时,则确定所述当前播放视频存在发声源;
其中,所述预设图像特征数据库中的图像特征包括人体形态特征和/或动物形态特征,还可以包括物体、物品的形态特征,所述人体形态特征可以指人发出声音时的口型姿态,所述动物形态特征可以指动物发出声音时的口型姿态,所述物体、物品的形态特征可以指物体、物品发出声音时的姿态,例如物体之间的敲打、摩擦姿态,物品被打碎时的姿态。所述图像特征的实质是当前播放 视频中存在的发声源的标识。所述预设图像特征数据库中的图像特征可以基于人工智能技术,通过自主学习功能对所述当前播放视频预先进行学习得到,同时对预设图像特征数据库中的图像特征在当前视频播放设备的显示屏的位置进行标记,因此在自主学习的过程中还要加入设置为播放当前视频的电子设备的屏幕尺寸信息,考虑到成本问题,播放当前视频的电子设备可以仅考虑目前市场上比较流行的配置有65寸液晶显示屏的智能电视机。
由于当前播放视频可通过对播放视频进行定期采样得到,通过设置所述设定阈值,可以将一些图像特征不明显的图像过滤掉,即将当前次采样得到的视频数据放弃,等待下一次的采样数据,从而可减少实现声音与图像同位的方法对系统资源的占用率,同时可提高确定当前播放视频中是否存在发声源的准确性。
若当前播放视频不存在发声源,则表示当前播放视频不存在具有明显方向属性的声音特征,无法体现声音与图像同位的播放效果,因此不对当前播放视频进行声音与图像的同位操作,直接按照常规的视频播放流程进行播放即可,将视频中的声音通过当前视频播放设备的所有声道进行播放即可。
步骤130,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息。
其中,所述视频显示屏是指设置为播放视频的电子设备的显示屏。所述预设图像特征数据库依据所述当前播放视频预先构建,所述预设图像特征数据库中保存有当前播放视频的发声源与其在视频显示屏的位置信息之间的对应关系,通过该对应关系可以查找到所述发声源在视频显示屏的位置信息。
步骤140,基于所述声音特征判断所述当前播放视频是否存在与所述发声源匹配的音源,若当前播放视频存在与所述发声源匹配的音源,则继续执行步骤 150,若当前播放视频不存在与所述发声源匹配的音源,则返回执行步骤110。
示例性的,所述基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源,包括:
将所述声音特征与预先建立的模型发声源的模型特征进行比对;
若存在与所述声音特征一致的模型特征且所述模型特征对应的模型发声源与所述当前播放视频中存在的发声源相同,则确定所述当前播放视频存在与所述发声源匹配的音源。
所述模型发声源的模型特征根据所述当前播放视频预先构建,在所述模型发声源中保存有当前播放视频中存在的发声源以及该发声源对应的声音特征,例如所述发声源为人物,对应的声音特征为唱歌声,则表示当前播放视频中的所述发声源对应的人物在唱歌;若所述发声源为人物,对应的声音特征为狗叫声,则表示当前播放视频中所述发声源对应的人物在学狗叫;若所述发声源为玻璃花瓶,对应的声音特征为玻璃打碎的声音,则表示当前播放视频中所述发声源对应玻璃花瓶发出了玻璃打碎的声音。
当从当前播放视频中识别出的声音特征以及确定出的发声源均与所述模型发声源中的模型特征以及对应的发声源匹配时,则表示当前播放视频中存在具有明显方向属性的声音特征,即音源。
其中,所述声音特征是指当前播放视频中包含的音频特征,例如唱歌声、说话声、动物叫喊声或者打碎物品的声音等。
步骤150,根据所述发声源在视频显示屏的位置信息生成控制信号,以控制与所述位置信息对应的声音还原元件发声。
其中,所述声音还原元件包括扬声器,所述声音还原元件依据所述视频显示屏幕预先划分的分区独立设置;
所述分区的数量依据显示屏幕的大小进行设定。
在一实施例中,可以将所述视频显示屏预先划分为特定数量的小区域,每个小区域由独立的声音还原元件进行发声,通过确定所述发声源在哪个小区域,进而控制对应小区域的声音还原元件发声,从而实现了声音与图像同位的目的,给观看视频的用户一种所述声音是由发声源发出的临场效果。例如,假设当前播放视频内容为“皇后娘娘在训斥奴才”,通过本实施例提供的实现声音与图像同位的方法,让观看者感觉训斥奴才的话正好是从皇后娘娘的嘴部发出来的,给观看者一种较强的临场感体验,提高观看者的沉浸感。假设当前播放视频内容为“鸟飞过树林发出悦耳动听的鸟叫声”,通过本实施例提供的实现声音与图像同位的方法,让观看者感觉鸟叫声是从显示屏上某个位置处的鸟发出来的,给观看者一种较强的临场感体验。
本实施例提供的实现声音与图像同位的方法,通过图像识别以及声音识别识别当前播放视频中的发声源以及对应的声音特征,当当前播放视频存在具有明显方向属性的声音特征时,通过获取当前播放视频中的发声源在视频显示屏的位置信息,根据所述位置信息控制发声源处的声音还原元件发声,从而实现了声音与图像的同位,给人一种视频声音从对应的发声源处发出的感觉,提高了观看者的临场感、沉浸感。
在上述技术方案的基础上,本实施例提供了另一种实现声音与图像同位的方法流程示意图,参见图2所示,所述方法包括:
步骤210,视频开始播放。
步骤220,对正在播放的视频进行视频采样。
考虑到系统资源的占用问题以及视频的帧频,本实施例采用每秒2次的采样频率对正在播放的视频进行视频采样,在尽量减少对系统资源的占用的同时 保证不对实现声音与图像同位的方法造成影响,不遗落视频中任何一个具有明显方向属性的发声源。
步骤230,对采样得到的视频进行视频解码。
在一实施例中,对采样得到的视频进行视频解码是为了分别得到视频中的图像数据和声音数据。
其中,对采样得到的视频进行视频解码可以利用本领域内成熟的解码算法进行,本实施例中不再赘述。
步骤231,得到视频中的声音数据。
步骤240,得到视频中的图像数据。
步骤250,根据图像数据进行图像识别,得到图像特征。
在一实施例中,可以通过调用图像识别接口进行图像识别操作,示例性地,所述图像识别接口为百度的人脸识别系统,可实现对当前播放视频中的图像特征进行有效识别。
步骤260,将所述图像特征与图像数据库中的图像特征进行匹配。
其中,所述图像数据库依据当前播放视频预先构建,其中保存有当前播放视频中存在的发声源的图像特征。
步骤270,确认是否获得匹配数据,若获得匹配数据,则执行步骤280,若没有获得匹配数据,则放弃当前的采样数据,进行下次采样。
其中,确认是否获得匹配数据的实质是判断图像数据库中是否存在与所述图像特征匹配的数据,若图像数据库中存在与所述图像特征匹配的数据,则执行步骤280的根据声音数据进行声音识别,得到声音特征;若图像数据库中不存在与所述图像特征匹配的数据,则放弃当前的采样数据,进行下次采样。
步骤280,根据声音数据进行声音识别,得到声音特征。
步骤290,将所述声音特征与声音数据库中的声音特征进行匹配。
其中,所述声音数据库依据当前播放视频预先构建,其中保存有当前播放视频中的发声源发出的声音的特征数据。
步骤2100,确认是否获得匹配数据,若获得匹配数据,则执行步骤2110,若没有获得匹配数据,则放弃当前的采样数据,进行下次采样。
其中,确认是否获得匹配数据的实质是判断所述声音数据库中是否存在与所述声音特征匹配的数据。
在一实施例中,所述声场控制信息为控制发声源在当前显示屏的位置处的扬声器发声的控制信息。
步骤2110,根据发声源在视频显示屏的位置信息输出声场控制信息,以控制相应的声场发声。
通过对采样得到的视频数据进行图像识别,实现了确定所述视频数据中是否存在具有明显方向属性的发声源的目的,当存在发声源时,则继续对视频数据进行声音特征识别,当具有与所述发声源匹配的声音特征时,则控制所述发声源在显示屏的位置处的喇叭发声,实现了声音与图像的同位,提高了视频的播放效果,给观看者带来较强的临场感体验。
实施例二
图3为本公开实施例二提供的一种实现声音与图像同位的方法流程示意图。在上述实施例的基础上,本实施例对所述发声源声音还原的实现过程进行了说明。参见图3所示,所述方法包括:
步骤310,对当前播放视频进行解码,分别得到当前播放视频对应的图像数据和声音数据。
步骤320,基于所述图像数据调用图像识别接口进行图像识别,得到与所述图像数据对应的图像特征,并基于所述声音数据调用声音识别接口进行声音识别,得到与所述声音数据对应的声音特征。
步骤330,基于所述图像特征判断所述当前播放视频是否存在发声源,若所述当前播放视频存在发声源,则继续执行步骤340,若所述当前播放视频不存在发声源,则返回执行步骤310。
步骤340,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息。
步骤350,基于所述声音特征判断所述当前播放视频是否存在与所述发声源匹配的音源,若存在与所述发声源匹配的音源,则继续执行步骤360,若不存在与所述发声源匹配的音源,则返回执行步骤310。
步骤360,根据所述发声源在视频显示屏的位置信息生成控制信号。
在一实施例中,设置为播放所述视频的电子设备的显示屏能够分区发声是本公开实施例提供的实现声音与图像同位的方法实现的基本前提,只有在显示屏相应位置处安装了发声元件才能实现具有临场感的声音效果。但是由于音源是具备区域大小属性的,因此不可能实现图像与声音的绝对同位。若采用虚拟声音算法将声音虚拟在显示屏上发声,则要实时进行视频图像识别与声场虚拟应用,会占用较多的中央处理单元(Central Processing Unit,CPU)资源,为了节约系统资源同时又能体现声音与图像同位的效果,本实施例将所述显示屏预先划分为特定数量的分区,且为每个分区虚拟出独立的声场,通过为每个分区配置独立的扬声器实现为每个分区虚拟出独立的声场。在一实施例中,所述视频显示屏包括预设尺寸的液晶显示屏。示例性的,所述预设尺寸可以为65寸(即6.5尺)。图4是一种显示屏的分区示意图,如图4所示,为了节约系统 资源同时又能体现声像同位的效果,将65寸及以上的显示屏划分为面积相等的6个分区,分别对应6个虚拟声场,所述6个虚拟声场分别为声场1、声场2、声场3、声场4、声场5和声场6;每个虚拟声场由独立的扬声器实现,对应的6个扬声器分别安装在显示屏的左上、左中、左下、右上、右中和右下六个方位。声场1和声场2的两个扬声器通过第一功率放大器驱动来还原声场1和声场2两个声场的声音;声场3和声场4的两个扬声器通过第二功率放大器驱动来还原声场3和声场4两个声场的声音;声场5和声场6对应的两个扬声器通过第三功率放大器驱动来还原声场5和声场6两个声场的声音。
音源是指具备声音信息的视频信号,所述音源可通过对视频进行解码获取,从视频中解码出的音源,即声音数据通过解码,可以从声音数据中分离出多个方向的声音数据,声音解码的方式有很多,例如ATMOS解码、数字化影院系统(Digital Theater System,DTS)解码等,但只有采用ATMOS解码才能将双声道的声音解码成8声道的声音。本实施例中,获取解码出来的6个方向的声音,即声场1、声场2、声场3、声场4、声场5和声场6六个方向的声音信号。由于这6个方向的声音信号都是调制在一个IIS信号中,因此,可以将上述第一功率放大器、第二功率放大器和第三功率放大器的驱动功能连接到同一个IIS信号进行解码。对音源进行解码并根据所述控制信号控制与所述位置信息对应的功率放大器工作以驱动扬声器发声的流程示意图可参见图5所示,所述方法包括:
步骤510,获取音源。
在一实施例中,可以通过对视频进行解码获取视频中的音源。
步骤520,通过ATMOS芯片对所述音源解码。
其中,所述ATMOS芯片配置在播放所述视频的电子设备中,所述IIS音频 信号包含有对所述第一功率放大器、第二功率放大器和第三功率放大器进行控制的控制逻辑。
步骤530,得到IIS音频信号。
步骤540,将所述声场控制信息发送至所述IIS音频信号。
其中,所述声场控制信息为控制所述发声源在当前显示屏的位置处的声场发声的控制信息,通过所述位置处的功率放大器驱动所述位置处的扬声器工作。
将所述声场控制信息发送至所述IIS音频信号的目的是将所述声场控制信息编码至所述IIS音频信号。IIS音频信号是一种数字信号,其内部调制了视频中多个方向的声音信号,而声场控制信息中包含有要被触发的声场位置信息,当IIS音频信号被还原成模拟信号时,根据所述声场控制信息来选择还原IIS音频信号中哪个方向的声音,因此可将所述声场控制信息编码至所述IIS音频信号一起被还原为模型信号。
在一实施例中,对视频中的声音进行还原的过程为:对所述IIS音频信号进行解码,获取与上述声场1、声场2、声场3、声场4、声场5和声场6,6个方向对应的声音,并利用解码得到声场控制信息控制与所述位置处的功率放大器工作以驱动对应的扬声器发声,从而实现还原所述位置处所在方向的声音。例如所述声场控制信息为触发声场3的控制信息,则在还原IIS音频信号时,就只还原声场3所在方向的声音信号,其它声场区不输送声音信号。
步骤370,将所述声音数据通过ATMOS解码,得到IIS音频信号。
步骤380,根据所述IIS音频信号以及所述控制信号控制与所述位置信息对应的功率放大器工作以驱动对应的扬声器发声。
例如,通过对当前播放视频进行图像识别,识别到当前播放视频中的发声源位于当前显示屏的声场3的区域,则控制声场3的扬声器发声,同时关闭其 他声场的扬声器,只保留声场3的扬声器发声,呈现一种声音的位置与发出声音的人的位置基本一致,给观看者带来较强的临场感体验。
在一实施例中,所述控制与所述位置信息对应的声音还原元件发声,包括:
当所述声音还原元件发出的声音幅度超过设定上限时,则降低所述声音还原元件的发声增益,当所述声音还原元件发出的声音幅度没有超过所述设定下限时,提高所述声音还原元件的发声增益。
当视频中没有具有明显方向属性的发声源时,视频声音是通过上述6个声场的扬声器共同发声来呈现的,而如果突然进入具有明显方向属性的发声源的视频画面时,则需仅通过与所述发声源位置对应的扬声器来发声,会导致视频声音的幅度突然变化,给观看者带来不好的体验。因此,通过采用声音幅度动态调整技术,即通过专业音效算法,将声音幅度控制在一个设定的范围,当声音幅度低于设定下限时,则提升与所述发声源位置对应的扬声器的增益;当声音幅度超过设定上限时,则降低与所述发声源位置对应的扬声器的增益,从而实现任何时刻视频的音量都在设定范围内。
本实施例提供的一种实现声音与图像同位的方法,通过将播放视频的电子设备的显示屏预先划分为特定数量的分区,并为每个分区配置独立的扬声器,以为每个分区虚拟出独立的声场,实现了节约系统资源同时又能体现声音与图像同位的播放效果的目的。
实施例三
图6为本公开实施例三提供的一种实现声音与图像同位的装置的结构示意图;参见图6所示,所述装置包括:识别模块610、获取模块620和控制模块630;
其中,识别模块610,设置为对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;获取模块620,设置为在基于所述图像特征获取所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息;控制模块630,设置为在基于所述声音特征获取所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
在一实施例中,识别模块610是设置为对当前播放视频进行解码,分别得到当前播放视频对应的图像数据和声音数据;基于所述图像数据调用图像识别接口进行图像识别,得到与所述图像数据对应的图像特征,并基于所述声音数据调用声音识别接口进行声音识别,得到与所述声音数据对应的声音特征。
在一实施例中,所述声音还原元件依据所述视频显示屏幕预先划分的分区独立设置;
其中,所述分区的数量依据显示屏幕的尺寸进行设定。
在一实施例中,控制模块630是设置为;将所述声音数据通过ATMOS解码,得到IIS音频信号;根据所述IIS音频信号以及所述控制信号控制与所述位置信息对应的功率放大器工作以驱动所述位置信息对应的扬声器发声。
在一实施例中,获取模块620包括发声源确定子模块640;发声源确定子模块640设置为基于所述图像特征确定所述当前播放视频存在发声源。
在一实施例中,发声源确定子模块640包括:
匹配单元,设置为将所述图像特征与预设图像特征数据库中的图像特征进 行相似度匹配;
确定单元,设置为当匹配的所述相似度达到设定阈值时,则确定所述当前播放视频存在发声源;
其中,所述预设图像特征数据库中的图像特征包括人体形态特征和/或动物形态特征。
在一实施例中,控制模块630包括音源确定子模块650;所述音源确定子模块650设置为:基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源。
在一实施例中,音源确定子模块650是设置为;将所述声音特征与预先建立的模型发声源的模型特征进行比对;若存在与所述声音特征一致的模型特征且所述模型特征对应的模型发声源与所述当前播放视频中存在的发声源相同,则确定所述当前播放视频存在与所述发声源匹配的音源。
在一实施例中,所述控制模块630还设置为:当所述声音还原元件发出的声音幅度超过设定上限时,降低所述声音还原元件的发声增益,当所述声音还原元件发出的声音幅度没有超过所述设定下限时,提高所述声音还原元件的发声增益。
本实施例提供的实现声音与图像同位的装置,通过图像识别以及声音识别识别当前播放视频中的发声源以及对应的声音特征,当当前播放视频存在具有明显方向属性的声音特征时,通过获取当前播放视频中的发声源在视频显示屏的位置信息,根据所述位置信息控制发声源处的声音还原元件发声,从而实现了声音与图像的同位,给人一种视频声音从对应的发声源处发出的感觉,提高了观看者的临场感、沉浸感。
上述产品可执行本公开任意实施例所提供的方法,具备执行方法相应的功 能模块和效果。未在本实施例中详尽描述的技术细节,可参见本公开任意实施例所提供的方法。
本公开的技术方案实现了声音与图像的同位置呈现,使观看视频的用户感觉到视频声音的位置与视频中发出声音的对象的位置基本一致,提高了视频的播放效果,提升了用户体验。
实施例四
图7为本公开实施例四提供的一种电子设备的结构示意图。如图7所示,该电子设备包括:处理器770、存储器771及存储在存储器771上并可在处理器770上运行的计算机程序;其中,处理器770的数量可以是一个或多个,图7中以一个处理器770为例;处理器770执行所述计算机程序时实现如上述实施例所述的实现声音与图像同位的方法。如图7所示,所述电子设备还可以包括输入装置772和输出装置773。处理器770、存储器771、输入装置772和输出装置773可以通过总线或其他方式连接,图7中以通过总线连接为例。
存储器771作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本公开实施例中实现声音与图像同位的方法对应的程序指令/模块(例如,实现声音与图像同位的装置中的识别模块610、获取模块620、和控制模块630等)。处理器770通过运行存储在存储器771中的软件程序、指令以及模块,从而执行电子设备的多种功能应用以及数据处理,即实现上述的实现声音与图像同位的方法。
存储器771可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器771可以包括高速随机存取存储器,还可以包 括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器771可以包括相对于处理器770远程设置的存储器,这些远程存储器可以通过网络连接至电子设备/存储介质。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置772可以设置为接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。输出装置773可以包括显示屏等显示设备。
实施例五
本公开实施例五还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种实现声音与图像同位的方法,该方法包括:
对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;
在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息;
在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;
其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
当然,本公开实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本公开任意实施例 所提供的实现声音与图像同位的方法中的相关操作。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本公开可借助软件及通用硬件来实现,也可以通过硬件实现。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括多个指令用以使得一台计算机设备(可以是个人计算机,存储介质,或者网络设备等)执行本公开一个或多个实施例所述的方法。

Claims (15)

  1. 一种实现声音与图像同位的方法,包括:
    对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;
    在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述发声源在视频显示屏的位置信息;
    在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;
    其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
  2. 根据权利要求1所述的方法,其中,所述对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征,包括:
    对所述当前播放视频进行解码,分别得到所述当前播放视频对应的图像数据和声音数据;
    基于所述图像数据调用图像识别接口进行图像识别,得到与所述图像数据对应的图像特征,并基于所述声音数据调用声音识别接口进行声音识别,得到与所述声音数据对应的声音特征。
  3. 根据权利要求2所述的方法,其中,所述声音还原元件依据所述视频显示屏预先划分的分区独立设置;
    其中,所述分区的数量依据显示屏的尺寸进行设定。
  4. 根据权利要求1或3所述的方法,其中,所述声音还原元件包括扬声器。
  5. 根据权利要求4所述的方法,其中,所述根据所述发声源在所述视频显 示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声包括;
    将所述声音数据通过杜比全景声ATMOS解码,得到集成音频接口IIS音频信号;
    根据所述IIS音频信号以及所述控制信号控制与所述位置信息对应的功率放大器工作以驱动所述位置信息对应的扬声器发声。
  6. 根据权利要求1所述的方法,其中,所述基于所述图像特征确定所述当前播放视频存在发声源,包括:
    将所述图像特征与预设图像特征数据库中的图像特征进行相似度匹配;
    当匹配的所述相似度达到设定阈值时,则确定所述当前播放视频存在发声源;
    其中,所述预设图像特征数据库中的图像特征包括下述至少一项:人体形态特征和动物形态特征。
  7. 根据权利要求1所述的方法,其中,所述基于所述声音特征切断所述当前播放视频存在与所述发声源匹配的音源,包括:
    将所述声音特征与预先建立的模型发声源的模型特征进行比对;
    若存在与所述声音特征一致的模型特征且所述模型特征对应的模型发声源与所述当前播放视频中存在的发声源相同,则确定所述当前播放视频存在与所述发声源匹配的音源。
  8. 根据权利要求1所述的方法,其中,所述控制与所述位置信息对应的声音还原元件发声,包括:
    当所述声音还原元件发出的声音幅度超过设定上限时,降低所述声音还原元件的发声增益;
    当所述声音还原元件发出的声音幅度没有超过所述设定下限时,提高所述声音还原元件的发声增益。
  9. 根据权利要求1-8任一项所述的方法,其中,所述当前播放视频通过按照单位时间内的预设采样次数对播放的视频进行采样而获得。
  10. 根据权利要求1-8任一项所述的方法,其中,所述视频显示屏包括预设尺寸的液晶显示屏。
  11. 一种实现声音与图像同位的装置,包括:
    识别模块,设置为对当前播放视频分别进行图像识别以及声音识别,以获取所述当前播放视频对应的图像特征和声音特征;
    获取模块,设置为在基于所述图像特征确定所述当前播放视频存在发声源的情况下,基于所述图像特征从预设图像特征数据库中获取所述当前播放视频的发声源在视频显示屏的位置信息;
    控制模块,设置为在基于所述声音特征确定所述当前播放视频存在与所述发声源匹配的音源的情况下,根据所述发声源在所述视频显示屏的位置信息生成控制信号,以根据所述控制信号控制与所述位置信息对应的声音还原元件发声;
    其中,所述预设图像特征数据库依据所述当前播放视频预先构建。
  12. 根据权利要求11所述的装置,其中,所述获取模块包括发声源确定子模块;所述发声源确定子模块设置为:
    将所述图像特征与预设图像特征数据库中的图像特征进行相似度匹配;
    当匹配的所述相似度达到设定阈值时,则确定所述当前播放视频存在发声源;
    其中,所述预设图像特征数据库中的图像特征包括下述至少一项:人体形 态特征和动物形态特征。
  13. 根据权利要求11所述的装置,其中,所述控制模块包括音源确定子模块;所述音源确定子模块设置为:
    将所述声音特征与预先建立的模型发声源的模型特征进行比对;
    若存在与所述声音特征一致的模型特征且所述模型特征对应的模型发声源与所述当前播放视频中存在的发声源相同,则确定所述当前播放视频存在与所述发声源匹配的音源。
  14. 一种电子设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1-10中任一项所述的方法。
  15. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时实现如权利要求1-10中任一项所述的方法。
PCT/CN2018/120528 2018-09-07 2018-12-12 实现声音与图像同位的方法、装置、设备及存储介质 WO2020048034A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811043120.4A CN109194999B (zh) 2018-09-07 2018-09-07 一种实现声音与图像同位的方法、装置、设备及介质
CN201811043120.4 2018-09-07

Publications (1)

Publication Number Publication Date
WO2020048034A1 true WO2020048034A1 (zh) 2020-03-12

Family

ID=64915471

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120528 WO2020048034A1 (zh) 2018-09-07 2018-12-12 实现声音与图像同位的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN109194999B (zh)
WO (1) WO2020048034A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109862293B (zh) * 2019-03-25 2021-01-12 深圳创维-Rgb电子有限公司 终端喇叭的控制方法、设备及计算机可读存储介质
US10922047B2 (en) 2019-03-25 2021-02-16 Shenzhen Skyworth-Rgb Electronic Co., Ltd. Method and device for controlling a terminal speaker and computer readable storage medium
CN110460863A (zh) * 2019-07-15 2019-11-15 北京字节跳动网络技术有限公司 基于显示位置的音视频处理方法、装置、介质和电子设备
CN111417064B (zh) * 2019-12-04 2021-08-10 南京智芯胜电子科技有限公司 一种基于ai识别的音画随行控制方法
CN113724628A (zh) * 2020-05-25 2021-11-30 苏州佳世达电通有限公司 影音系统
CN113810837B (zh) * 2020-06-16 2023-06-06 京东方科技集团股份有限公司 一种显示装置的同步发声控制方法及相关设备
CN116158091A (zh) * 2020-06-29 2023-05-23 海信视像科技股份有限公司 显示设备及屏幕发声方法
CN111836083B (zh) * 2020-06-29 2022-07-08 海信视像科技股份有限公司 显示设备及屏幕发声方法
CN112135226B (zh) * 2020-08-11 2022-06-10 广东声音科技有限公司 Y轴音频再生方法以及y轴音频再生系统
CN115442549B (zh) * 2021-06-01 2024-09-17 Oppo广东移动通信有限公司 电子设备的发声方法及电子设备
CN116266874A (zh) * 2021-12-17 2023-06-20 华为技术有限公司 视频播放中协同播放音频的方法及通信系统
CN117014785A (zh) * 2022-04-27 2023-11-07 华为技术有限公司 一种音频播放方法及相关装置
CN114827686A (zh) * 2022-05-09 2022-07-29 维沃移动通信有限公司 录制数据处理方法、装置及电子设备
WO2023230886A1 (zh) * 2022-05-31 2023-12-07 京东方科技集团股份有限公司 音频控制方法、控制装置、驱动电路以及可读存储介质
CN115002401B (zh) * 2022-08-03 2023-02-10 广州迈聆信息科技有限公司 一种信息处理方法、电子设备、会议系统及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829018B2 (en) * 2001-09-17 2004-12-07 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
CN101370107A (zh) * 2005-10-17 2009-02-18 索尼株式会社 图像显示装置、方法和程序
CN104036789A (zh) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体装置
CN104270552A (zh) * 2014-08-29 2015-01-07 华为技术有限公司 一种声像播放方法及装置
CN105979470A (zh) * 2016-05-30 2016-09-28 北京奇艺世纪科技有限公司 全景视频的音频处理方法、装置和播放系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101459797B (zh) * 2007-12-14 2012-02-01 深圳Tcl新技术有限公司 一种声音定位的方法及系统
CN102480671B (zh) * 2010-11-26 2014-10-08 华为终端有限公司 视频通信中的音频处理方法和装置
US8311973B1 (en) * 2011-09-24 2012-11-13 Zadeh Lotfi A Methods and systems for applications for Z-numbers
CN103413511A (zh) * 2013-07-17 2013-11-27 安伟建 一种语音导览系统
US9282399B2 (en) * 2014-02-26 2016-03-08 Qualcomm Incorporated Listen to people you recognize
CN106346491A (zh) * 2016-10-25 2017-01-25 塔米智能科技(北京)有限公司 一种基于人脸信息的智能会员服务机器人系统
CN107705796A (zh) * 2017-09-19 2018-02-16 深圳市金立通信设备有限公司 一种音频数据的处理方法、终端及计算机可读介质
CN108419141B (zh) * 2018-02-01 2020-12-22 广州视源电子科技股份有限公司 一种字幕位置调整的方法、装置、存储介质及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829018B2 (en) * 2001-09-17 2004-12-07 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
CN101370107A (zh) * 2005-10-17 2009-02-18 索尼株式会社 图像显示装置、方法和程序
CN104036789A (zh) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体装置
CN104270552A (zh) * 2014-08-29 2015-01-07 华为技术有限公司 一种声像播放方法及装置
CN105979470A (zh) * 2016-05-30 2016-09-28 北京奇艺世纪科技有限公司 全景视频的音频处理方法、装置和播放系统

Also Published As

Publication number Publication date
CN109194999A (zh) 2019-01-11
CN109194999B (zh) 2021-07-09

Similar Documents

Publication Publication Date Title
WO2020048034A1 (zh) 实现声音与图像同位的方法、装置、设备及存储介质
US12051443B2 (en) Enhancing audio using multiple recording devices
US11640275B2 (en) Devices with enhanced audio
CN109658932B (zh) 一种设备控制方法、装置、设备及介质
US10065124B2 (en) Interacting with a remote participant through control of the voice of a toy device
US20190027129A1 (en) Method, apparatus, device and storage medium for switching voice role
US9263044B1 (en) Noise reduction based on mouth area movement recognition
CN104049721B (zh) 信息处理方法及电子设备
TWI436808B (zh) Input support device, input support method and recording medium
US11126389B2 (en) Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
US20230047858A1 (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program product for video communication
CN118102179A (zh) 音频处理方法和系统及相关非暂时性介质
CN114207715A (zh) 用于分布式音频设备的声学回声消除控制
KR102639526B1 (ko) 발화 영상 제공 방법
CN114822568A (zh) 音频播放方法、装置、设备及计算机可读存储介质
US11170754B2 (en) Information processor, information processing method, and program
KR20130054131A (ko) 디스플레이장치 및 그 제어방법
CN112995530A (zh) 视频的生成方法、装置及设备
US20180081352A1 (en) Real-time analysis of events for microphone delivery
CN114416014A (zh) 屏幕发声方法、装置、显示设备及计算机可读存储介质
CN115167733A (zh) 一种直播资源的展示方法、装置、电子设备和存储介质
US10965391B1 (en) Content streaming with bi-directional communication
US11830120B2 (en) Speech image providing method and computing device for performing the same
US20240281203A1 (en) Information processing device, information processing method, and storage medium
CN111696564B (zh) 语音处理方法、装置和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18932471

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18932471

Country of ref document: EP

Kind code of ref document: A1