WO2023193803A1 - 音量控制方法、装置、存储介质和电子设备 - Google Patents

音量控制方法、装置、存储介质和电子设备 Download PDF

Info

Publication number
WO2023193803A1
WO2023193803A1 PCT/CN2023/087019 CN2023087019W WO2023193803A1 WO 2023193803 A1 WO2023193803 A1 WO 2023193803A1 CN 2023087019 W CN2023087019 W CN 2023087019W WO 2023193803 A1 WO2023193803 A1 WO 2023193803A1
Authority
WO
WIPO (PCT)
Prior art keywords
area
target person
image frame
virtual microphone
mouth
Prior art date
Application number
PCT/CN2023/087019
Other languages
English (en)
French (fr)
Inventor
朱长宝
Original Assignee
南京地平线机器人技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京地平线机器人技术有限公司 filed Critical 南京地平线机器人技术有限公司
Publication of WO2023193803A1 publication Critical patent/WO2023193803A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path

Definitions

  • the present disclosure relates to artificial intelligence technology, and in particular, to a volume control method, device, storage medium and electronic device.
  • singing systems are no longer limited to using traditional physical microphones for singing, but can also use gestures or virtual microphones formed by holding other objects to sing.
  • the volume of the vocal playback can usually be adjusted through the sound collection device or volume adjustment device in the microphone.
  • Embodiments of the present disclosure provide a volume control method, device, storage medium and electronic device.
  • a first aspect of the present disclosure provides a volume control method, including: acquiring a sequence of image frames in a spatial area including persons in the spatial area; and determining each of the image frames based on the image frame sequence.
  • the virtual microphone area and the target person in the image frame based on each of the image frames, determining the mouth area of the target person in each image frame; based on the mouth area of the target person in each image frame and the virtual microphone area, determine the distance between the mouth area of the target person and the virtual microphone area; obtain the voice signal in the spatial area, and determine the vocal audio of the target person based on the voice signal; according to the mouth of the target person
  • the distance between the upper area and the virtual microphone area is adjusted to adjust the playback volume of the vocal audio of the target person.
  • a second aspect of the present disclosure provides a volume control system, including: a voice collection device located in a spatial area, an image collection device, an audio playback device, and a controller, wherein the audio playback device is used in the controller The audio is played under the control, and the controller is used to execute the method proposed in the embodiment of the first aspect of the present disclosure.
  • a third aspect of the present disclosure provides a volume control device, including: a first acquisition module for acquiring a sequence of image frames in a space area including people in the space area; a first determination module for based on the each image frame in the image frame sequence, determine the image frame The virtual microphone area and the target person in the image frame; the second determination module is used to determine the mouth area of the target person in each image frame based on the image frame; the third determination module is used to determine the mouth area of the target person in each image frame based on the The target person's mouth area and the virtual microphone area in each image frame determine the distance between the target person's mouth area and the virtual microphone area; the second acquisition module is used to acquire the speech signal in the spatial area, based on The voice signal determines the vocal audio of the target person; the volume adjustment module is used to adjust the playback volume of the target person's vocal audio according to the distance between the mouth area of the target person and the virtual microphone area.
  • a fourth aspect of the disclosure provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to execute the method proposed in the embodiment of the first aspect of the disclosure.
  • a fifth aspect of the present disclosure provides an electronic device, the electronic device comprising: a processor; a memory for storing instructions executable by the processor; and the processor for reading from the memory.
  • the executable instructions are fetched and executed to implement the method proposed in the embodiment of the first aspect of the present disclosure.
  • the human voice playback volume is simple and fast. adjustments, thereby improving the user’s singing experience.
  • Figure 1 is a scene diagram to which this disclosure is applicable
  • Figure 2 is a schematic flowchart of a volume control method provided by an exemplary embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of step S202 provided by an exemplary embodiment of the present disclosure.
  • Figure 4 is a schematic flowchart of step S203 provided by an exemplary embodiment of the present disclosure.
  • Figure 5 is a schematic diagram of facial key points in an image frame provided by an exemplary embodiment of the present disclosure.
  • Figure 6 is a schematic flowchart of step S204 provided by an exemplary embodiment of the present disclosure.
  • Figure 7 is a schematic flowchart of step S205 provided by an exemplary embodiment of the present disclosure.
  • Figure 8 is a schematic structural diagram of a volume control system provided by an exemplary embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a volume control device provided by an exemplary embodiment of the present disclosure.
  • FIG. 10 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • the inventor found that since the virtual microphone in the singing system using a virtual microphone does not have a sound collection device or a volume adjustment device, the singing system cannot adjust the vocal playback volume through the virtual microphone, resulting in poor user experience. Difference.
  • a singing system using a virtual microphone may include a voice collection device, an image collection device, an audio playback device and a controller. The voice collection device, image collection device and audio playback device are communicated with the controller.
  • the image acquisition device can be a monocular camera, a binocular camera or a TOF (Time of Flight) camera, etc.
  • the voice acquisition device can be a microphone or a microphone array, etc.
  • the audio playback device can be a speaker or speaker equipment, etc.
  • the controller can be a computing device. platform or server, etc.
  • the present disclosure can obtain an image frame sequence in a spatial area through an image acquisition device.
  • the speech signal in the spatial area is obtained through the speech collection device.
  • the collected voice signal and image frame sequence are sent to the controller.
  • the controller processes the image frame sequence and voice signal to obtain the distance between the target person's mouth area and the virtual microphone.
  • the playback volume of the target person's vocal audio is obtained based on the distance between them, and the audio playback device is controlled to play the target person's vocal audio at the playback volume.
  • FIG. 2 is a schematic flowchart of a volume control method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, as shown in Figure 2, including the following steps:
  • Step S201 Obtain a sequence of image frames in the spatial area including people in the spatial area.
  • the space area can be a space for singing, for example, the space area can be an interior space of a vehicle, an interior space of a mini KTV private room, etc.
  • the video in the spatial area can be collected through an image acquisition device provided in the spatial area, and then the image frames including the people in the spatial area are identified through image recognition technology, and then the images including the people in the spatial area are identified.
  • Frames are arranged in chronological order, Get a sequence of image frames.
  • identifying people in a spatial area when a specific part of the human body (for example, face, head or torso, etc.) in the image frame is recognized, it can be determined that the image frame includes the spatial area. personnel within.
  • Step S202 Based on each image frame in the image frame sequence, determine the virtual microphone area and target person in each image frame.
  • image recognition technology is used to identify each image frame in the image frame sequence, and the target person and virtual microphone area in each image frame are determined.
  • the virtual microphone can be a preset gesture or a handheld object (for example, a water bottle or a mobile phone, etc.).
  • the target person is the person singing in the space area.
  • each person needs to be identified to determine whether it is the target person.
  • the recognition of each image frame in step S201 may be rough image recognition, with the purpose of determining whether there is a person in the spatial area of the image frame.
  • the image recognition method used in step S202 has higher image recognition accuracy than the image recognition method used in step S201. It is necessary to determine the target person and virtual microphone area in the image frame so that subsequent steps can be based on each image frame. The target person and virtual microphone area in the image are used for further subsequent processing.
  • Step S203 Based on each image frame in the image frame sequence, determine the mouth area of the target person in each image frame.
  • the target person in each image frame is identified through step S202, the target person in each image frame is identified using image recognition technology, and the mouth area of the target person in each image frame is determined.
  • the target person in each image frame can be identified through a neural network trained to identify the mouth area to obtain the mouth area of the target person in each image frame.
  • the neural network can be used to quickly Regional Convolutional Neural Networks (Faster Region Convolutional Neural Networks, Faster-RCNN), YOLO (You Only Look Once), etc.
  • the facial key points of the target person in each image frame can be determined through a neural network trained for recognizing faces, and the facial key points of the target person in each image frame can be determined according to the facial key points of the target person in each image frame. Face key points, determine the mouth key points of the target person in each image frame based on the facial key points of the target person in each image frame, determine each image frame based on the mouth key points of the target person in each image frame The mouth area of the target person.
  • the target person in each image frame can be identified through a trained neural network for face recognition, the facial image of the target person in each image frame can be obtained, and the target person in each image frame can be detected. Whether there is occlusion in the mouth area in the facial image.
  • the preset position of the facial image can be determined to be the mouth area, thereby obtaining the mouth area of the target person.
  • the preset position can be the face image.
  • the lower part, etc.; when there is no occlusion, the mouth area of the target person can be determined through neural networks, etc.
  • Step S204 Based on the mouth area of the target person and the virtual microphone area in each image frame, determine the distance between the mouth area of the target person and the virtual microphone area.
  • the distance between the mouth area of the target person in each image frame and the virtual microphone area is calculated. distance.
  • the distance between the target person's mouth area and the virtual microphone area is determined through the distance between the target person's mouth area and the virtual microphone area in each image frame.
  • the first preset point of the mouth area and the second preset point of the virtual microphone area in each image frame can be obtained first, such as dividing each The center point of the lower lip area in the mouth area in the image frame is used as the first preset point, and the top of the virtual microphone area in each image frame is used as the second preset point.
  • the distance between the mouth area of the target person and the virtual microphone area can be Euclidean distance, Manhattan distance, Chebyshev distance, Minkovsky distance or Mahalanobis distance, etc.
  • the determined distance between the target person's mouth area and the virtual microphone area may be the distance between the target person's mouth area and the virtual microphone area in each image frame, or may be based on the distance between the target person's mouth area and the virtual microphone area in each image frame.
  • the distance between the target person's mouth area and the virtual microphone area determines the final distance between the target person's mouth area and the virtual microphone area.
  • Step S205 Obtain the voice signal in the spatial area, and determine the vocal audio of the target person based on the voice signal.
  • a voice collection device is provided in the space area.
  • the audio signals in the space area are collected through the voice collection device installed in the space area.
  • the audio signal includes a speech signal and a noise signal, and the speech signal includes the human voice audio of people inside the space area.
  • the audio signals collected by the speech collection device can be separated into human voices through technologies such as audio noise reduction to obtain speech signals.
  • Determine the target person in each image frame according to step S202 determine the position of the target person in the spatial area, and determine the sound area corresponding to each human voice audio in the speech signal through sound area positioning technology based on the position of the target person in the spatial area, Establish the corresponding relationship between the human voice audio and the sound area, determine the corresponding sound area of the target person based on the position of the target person and the location of the sound area, and determine the target person based on the corresponding sound area of the target person and the corresponding relationship between the human voice audio and the sound area. vocal audio, and extract the vocal audio.
  • Step S206 Adjust the playback volume of the target person's vocal audio according to the distance between the target person's mouth area and the virtual microphone area.
  • the playback volume of the human voice audio of the target person in the space area is determined, and the audio playback is controlled
  • the device plays the target person's vocal audio at a determined playback volume.
  • the corresponding relationship between the distance between the target person's mouth area and the virtual microphone area and the playback volume can be preset so that each distance corresponds to a playback volume, such as the playback volume corresponding to distances of 5cm, 10cm, and 15cm.
  • the volume is 20dB (decibel), 15dB, 10dB, etc.
  • the above example is used to illustrate this embodiment, and in actual application, it can be set according to actual needs.
  • a playback volume can be determined through a sequence of image frames, and the vocal audio of the target person can be played at the playback volume.
  • the distance between the mouth area of the target person and the virtual microphone area in each image frame can be determined.
  • Distance determine the final distance between the target person's mouth area and the virtual microphone area
  • the target person corresponding to the distance between the mouth area of the target person and the virtual microphone area of each image frame may also be determined based on the distance between the mouth area of the target person and the virtual microphone area of each image frame in the image frame sequence.
  • the playback volume of the target person is played at the playback volume of the target person in each image frame. frequency.
  • the playback volume of the target person's vocal audio is determined by measuring the distance between the target person's mouth area and the virtual microphone area in the image frame sequence, and the human voice audio is adjusted to be played at the playback volume, thereby achieving
  • the virtual microphone can control the vocal playback volume simply and quickly, thereby improving the user's singing experience.
  • step 202 may include the following steps:
  • Step S2021 identify each image frame in the image frame sequence, and determine the image area of the handheld virtual microphone in each image frame.
  • Step S2022 Determine the virtual microphone area in each image frame based on the image area of the handheld virtual microphone in each image frame, and determine the person holding the virtual microphone in each image frame as the target person in each image frame.
  • the image area of the handheld virtual microphone in each image frame can be identified through the neural network trained for area of interest recognition, and the image of the handheld virtual microphone in each image frame can be extracted. area, and then identify the image area of the handheld virtual microphone in each image frame to obtain the virtual microphone area in each image frame.
  • the virtual microphone is a preset gesture
  • the hand area with the preset gesture in each image frame can be identified through the neural network, and then the hand area with the preset gesture in each image frame is determined as each The virtual microphone area of the image frame.
  • the neural network can be a convolutional neural network (Convolutional Neural Networks, CNN) or a fast area convolutional neural network.
  • the hand area holding the virtual microphone in each image frame is determined based on the image area of the handheld virtual microphone in each image frame, based on the correspondence between the person in each image frame and his or her hand area.
  • step 203 may include the following steps:
  • Step S2031 Obtain the mouth key points of the target person in each image frame.
  • the facial key points of the target person in each image frame can be determined through a trained neural network for identifying facial key points.
  • the neural network can be a convolutional neural network, a fast regional convolutional neural network, or YOLO, etc.
  • Facial key points include mouth key points, eye key points, nose key points and facial contour key points.
  • the mouth key points can be determined based on the facial key points
  • Figure 5 shows a schematic diagram of the facial key points of the target person in an image frame. As shown in Figure 5, there are 68 key points on the face. Each key point corresponds to a serial number. According to the corresponding relationship between the serial number and the facial position, The mouth key points of the target person in the image frame are obtained.
  • the key points with serial numbers 49 to 68 are the mouth key points.
  • the key points of the mouth of the target person in each image frame can also be determined directly through the neural network.
  • Step S2032 Determine the mouth area of the target person in each image frame based on the key points of the target person's mouth in each image frame.
  • Each mouth key point has position information, and the position information may be the coordinate value of the mouth key point.
  • the mouth area of the target person in each image frame can be determined based on the position information of the mouth key points in each image frame. For example, based on the position information of the mouth key point, an external detection frame can be formed outside the mouth key point, and the external detection frame includes the mouth key point. The externally detected area is determined as the mouth area.
  • the bounding detection frame can be a rectangle or other shapes.
  • the key points of the target person's mouth in each image frame are first determined, and then the mouth area of the target person is determined through the mouth key points, which provides a method for quickly and accurately determining the key points of the target person's mouth. Method to realize.
  • step 204 may include the following steps:
  • Step S2041 Determine the first preset identification point of the mouth area of the target person in each image frame.
  • a neural network or the like can be used to obtain the mouth key points of the mouth area of the target person in each image frame, and any one of the mouth key points of the target person's mouth can be used as the mouth key point.
  • the first preset identification point of the mouth area of the target person in the image frame For example, the upper lip center position, lower lip center position, mouth corner position, center position, upper lip top position or lower lip top position, etc. in the mouth area of the target person in each image frame can be used as the first preset identification point .
  • the first preset identification point of the mouth area in each image frame is the same point.
  • the point corresponding to the center position of the upper lip of the target person's mouth area in each image frame is used as the first preset identification point of the target person's mouth area in each image frame.
  • Step S2042 Determine the second preset identification point of the virtual microphone area in each image frame.
  • any position of the virtual microphone area in each image frame can be used as the second preset identification point of the virtual microphone area.
  • the vertex position, the center position, the center position of the upper region, the center position of the lower region, etc. of the virtual microphone area in each image frame can be determined as the second preset point.
  • the second preset identification point of the virtual microphone area in each image frame is the same point.
  • the point corresponding to the center position of the upper area of the virtual microphone area in each image frame is determined as the second preset identification point of the virtual microphone area in each image frame.
  • Step S2043 Determine the distance between the target person's mouth area and the virtual microphone area based on the first preset identification point and the second preset identification point in each image frame.
  • the distance between the first preset identification point and the second preset identification point in the image frame is determined.
  • distance For example, a playback volume can be determined through a sequence of image frames, and the vocal audio of the target person can be played at the playback volume.
  • one of the first preset identification point and the second preset identification point in each image frame can be between
  • the average value of the distances is determined as the final distance between the mouth area of the target person and the virtual microphone area.
  • the distance between the first preset identification point and the second preset identification point in each image frame can also be input and trained.
  • the neural network used to determine the distance is used to obtain the final distance between the target person's mouth area and the virtual microphone area; based on the final distance between the target person's mouth area and the virtual microphone area, the target person's vocal is determined The audio playback volume.
  • the distance between the mouth area and the virtual microphone area of the target person in each image frame can also be determined based on the first preset identification point and the second preset identification point of the target person in each image frame.
  • the distance between the mouth area of the target person and the virtual microphone area determines the playback volume of the target person in each image frame, and the vocal audio of the target person is played at the playback volume of the target person in each image frame.
  • step S2041 includes: for the mouth area in each image frame, based on the mouth area or mouth key points of the target person, determine the center point of the mouth area of the target person as the target person's The first preset identification point of the mouth area.
  • the center point of the target person's mouth area can be determined based on the coordinate values of the vertices of the detection frame surrounding the target person's mouth area.
  • the center point of the target person's mouth area can also be determined by determining the outside of the target person's mouth area. Contour data, determine the center point of the target person's mouth area based on the external outline data of the target person's mouth area, or determine the target person's mouth through the coordinate values of the key points of the mouth in the target person's mouth area.
  • the center point of the mouth area is determined as the first preset identification point of the mouth area of the target person.
  • step S2042 includes: for the virtual microphone area in each image frame, based on the virtual microphone area, determine the center point of the virtual microphone area as the second preset identification point of the virtual microphone area.
  • the center point of the virtual microphone area can be determined by the coordinate value of the vertex of the external detection frame of the virtual microphone area, and the external contour data of the virtual microphone area can also be determined. Contour data is used to determine the center point of the virtual microphone area; and then the center point of the virtual microphone area is determined as the second preset identification point of the virtual microphone area.
  • the coordinate value of the first preset identification point of the mouth area of the target person and the coordinate value of the second preset point of the virtual microphone area are obtained.
  • the distance between the target person's mouth area and the virtual microphone area in the image frame is calculated according to formula (1).
  • (x1, y1, z1) is the coordinate value of the first preset identification point
  • (x2, y2, z2) is the coordinate value of the second preset identification point
  • d is the distance between the mouth area and the virtual microphone area. distance.
  • step 205 may include the following steps:
  • Step S2051 Perform speech separation based on the speech signal to obtain the vocal audio information of the people in the spatial area.
  • the human voice audio information of the person in the space area includes: the human voice audio and the corresponding sound zone of the human voice audio.
  • the speech signals of the people in the space area are obtained.
  • based on sound source positioning technology Determines the register of the vocal audio of people within an area of space.
  • Acoustic noise reduction processing of audio signals can include: first obtaining a reference signal, and performing acoustic feedback processing on the audio signal based on the reference information to eliminate howling in the audio signal.
  • the audio signal can be processed with acoustic feedback through a howling suppression algorithm.
  • the reference signal is the playback signal of the audio playback device used to play human voice audio; then, the audio signal processed by the acoustic feedback is subjected to noise reduction processing to eliminate the noise in the voice signal and obtain the voice of the person in the clean space area
  • the speech signal includes the vocal audio of all people in the spatial area, in which spectral subtraction and OMLSA (Optimally-modified Log-spectral Amplitude) algorithms can be used to perform noise reduction processing on the audio signal.
  • the sound zone corresponding to each person's vocal audio can be determined through sound source positioning technology, and the corresponding relationship between the vocal audio and the sound zone can be established.
  • Step S2052 Based on the target person in each image frame, determine the location of the target person in each image frame.
  • the target person in each image frame is determined according to step S202, and the position of the target person in each image frame is obtained.
  • the area image of the target person in each image frame can be extracted, and the area image of the target person in each image frame is input into a trained neural network to obtain the position of the target person in each image frame.
  • Step S2053 Determine the target person's vocal audio based on the target person's location and vocal audio information in each image frame.
  • the final position of the target person can be determined through the position of the target person in each image frame.
  • the positions of the target person in each image frame can be summed and averaged to obtain the final position of the target person in the spatial area.
  • the position of the target person in each image frame can also be input into the neural network to obtain the space.
  • the final location of the target person within the area According to the final position of the target person and the position of the vocal area in the vocal audio information, the vocal area corresponding to the target person is determined; according to the correspondence between the vocal audio and the vocal area, the vocal audio corresponding to the vocal area is extracted, that is, the target is obtained Human voice audio.
  • the human voice audio of the person in the space area and the sound zone corresponding to the human voice audio are obtained based on the speech signal, and then based on the position of the target person in the space area and the human voice audio and human voice of the person in the space area.
  • the sound zone corresponding to the voice audio determines the vocal audio of the target person in the spatial area. Fast and accurate determination of the target person’s vocal audio is achieved.
  • step S206 includes: based on the preset correspondence between the distance and the playback volume, adjusting the target person's vocal audio according to the distance between the target person's mouth area and the virtual microphone area. playback volume.
  • the method further includes: mixing the target person's vocal audio and the accompaniment audio, and playing them at the playback volume through an audio playback device in the spatial area.
  • the vocal audio of the target person in the space area is mixed with the accompaniment audio to obtain the mixed accompaniment vocal audio, and the vocal audio of the target person in the mixed accompaniment vocal audio is used as the target person's voice through the audio playback device in the space area.
  • the playback volume of the vocal audio can be played through the audio playback device in the space area at a preset playback volume, or can be played through the audio playback device in the space area after adjusting the playback volume of the vocal audio.
  • Any volume control method provided by the embodiments of the present disclosure can be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers.
  • any of the volume control methods provided by the embodiments of the present disclosure can be executed by the processor.
  • the processor executes any of the volume control methods mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. No further details will be given below.
  • Figure 8 is a structural block diagram of a volume control system in an embodiment of the present disclosure. As shown in Figure 8, it includes: a voice collection device, an image collection device, an audio playback device, and a controller located in the spatial area.
  • the audio playback device is used to play audio under the control of the controller, and the controller is used to execute the described Volume control method.
  • the image acquisition device is used to collect the image frame sequence in the spatial area
  • the audio acquisition device is used to collect the voice signal in the spatial area
  • the controller is used to process the image frame sequence and the voice signal to obtain the spatial
  • the playback volume of the vocal audio of the target person in the area is controlled, and the audio playback device is controlled to play the vocal audio of the target person in the space area at the playback volume.
  • Figure 9 is a structural block diagram of a volume control device in an embodiment of the present disclosure.
  • the volume control device includes: a first acquisition module 100 , a first determination module 101 , a second determination module 102 , a third determination module 103 , a second acquisition module 104 , and a volume adjustment module 105 .
  • the first acquisition module 100 is used to acquire a sequence of image frames in a spatial area including persons in the spatial area;
  • the first determination module 101 is configured to determine the virtual microphone area and target person in each image frame based on each image frame in the image frame sequence;
  • the second determination module 102 is configured to determine the mouth area of the target person in each image frame based on each image frame;
  • the third determination module 103 is configured to determine the distance between the mouth area of the target person and the virtual microphone area based on the mouth area of the target person and the virtual microphone area in each image frame;
  • the second acquisition module 104 is used to acquire the voice signal in the spatial area, and determine the vocal audio of the target person based on the voice signal;
  • the volume adjustment module 105 is used to adjust the target according to the distance between the mouth area of the target person and the virtual microphone area. The playback volume of the person's vocal audio.
  • the first determination module 101 includes:
  • the first determination sub-module is used to identify each image frame in the image frame sequence and determine the image area of the handheld virtual microphone in each image frame;
  • the second determination sub-module is used to determine the virtual microphone area in each image frame based on the image area of the handheld virtual microphone in each image frame, and determine the person holding the virtual microphone in each image frame. Determine the target person in each image frame.
  • the second determination module 102 includes:
  • the third determination sub-module is used to obtain the mouth key points of the target person in each image frame
  • the fourth determination sub-module is used to determine the mouth area of the target person in each image frame according to the key points of the target person's mouth in each image frame.
  • the third determination module 103 includes:
  • the fourth determination sub-module is used to determine the first preset identification point of the mouth area of the target person in each image frame
  • the fifth determination sub-module is used to determine the second preset identification point of the virtual microphone area in each image frame
  • the sixth determination sub-module is used to determine the distance between the target person's mouth area and the virtual microphone area based on the first preset identification point and the second preset identification point in each image frame. distance.
  • the fourth determination sub-module is further configured to determine the target person based on the mouth area or mouth key points of the target person for the mouth area in each image frame.
  • the center point of the mouth area is the first preset identification point of the target person's mouth area;
  • the fifth determination sub-module is further configured to determine, based on the virtual microphone area of the virtual microphone area in each image frame, that the center point of the microphone area is the second preset identification point of the virtual microphone area. .
  • the second acquisition module 104 includes:
  • the first acquisition sub-module is used to perform speech separation based on the speech signal and obtain the human voice audio information of the person in the spatial area.
  • the human voice audio information of the person includes: the human voice audio and the corresponding human voice audio. sound area;
  • the sixth determination sub-module is used to determine the location of the target person in the spatial area based on the target person in each image frame;
  • a seventh determination sub-module is used to determine the vocal audio of the target person based on the location of the target person in the spatial area and the vocal audio information.
  • the volume adjustment module 105 is also used to adjust the distance between the target person's mouth area and the virtual microphone area based on the preset correspondence relationship between the distance and the playback volume. Describes the playback volume of the target person’s vocal audio.
  • the volume control device further includes:
  • a mixing module used to mix the target person's vocal audio and accompaniment audio, and play them at the playback volume through an audio playback device in the space area.
  • the electronic device 10 includes one or more processors 11 and memories 12 .
  • the processor 11 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
  • CPU central processing unit
  • Memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the volume control methods and/or other methods of various embodiments of the present disclosure described above. Desired functionality.
  • Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 10 may further include an input device 13 and an output device 14, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 13 may be the above-mentioned microphone or microphone array, used to capture the input signal of the sound source.
  • the input device 13 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 14 can output various information to the outside, including determined distance information, direction information, etc.
  • the output device 14 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device may include any other suitable components depending on the specific application.
  • embodiments of the present disclosure may also be a computer program product, which includes computer program instructions that, when executed by a processor, cause the processor to perform the “exemplary method” described above in this specification
  • the steps in the volume control method according to various embodiments of the present disclosure are described in Sec.
  • the computer program product may be written with program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc. , also includes conventional procedural programs A design language such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • embodiments of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon.
  • the computer program instructions when executed by a processor, cause the processor to execute the above-mentioned “example method” part of this specification.
  • the steps in the volume control method according to various embodiments of the present disclosure are described in .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

公开了一种音量控制方法、装置、存储介质和电子设备,其中,方法包括:获取空间区域内的包括空间区域内的人员的图像帧序列和语音信号;确定各图像帧中的虚拟麦克风区域和目标人员的嘴部区域;确定空间区域内的目标人员的嘴部区域与虚拟麦克风区域之间的距离;确定目标人员的人声音频;根据嘴部区域与虚拟麦克风区域之间的距离,调整目标人员的人声音频的播放音量。由此可知,本公开实施例中通过检测目标人员的嘴部区域和虚拟麦克风区域之间的距离变化,并根据检测到的距离变化调整目标人员的人声音频的播放音量,实现了人声播放音量简单、快捷的调整。

Description

音量控制方法、装置、存储介质和电子设备
本公开要求在2022年4月8日提交的、申请号为202210368353.1、发明名称为“音量控制方法、装置、存储介质和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及人工智能技术,尤其涉及一种音量控制方法、装置、存储介质和电子设备。
背景技术
随着技术的不断发展,唱歌系统已不在局限于采用传统的实体麦克风进行唱歌,也可以采用手势或者手握其他物体所形成的虚拟麦克风进行唱歌。传统采用实体麦克风的唱歌系统中通常可以通过麦克风中的收音装置或音量调节装置调节人声的播放的音量。
发明内容
现有采用虚拟麦克风的唱歌系统中使用的是虚拟麦克风,其并无收音装置或音量调节装置,因此无法通过虚拟麦克风调节人声播放音量,导致用户体验感差。
为了解决上述技术问题,提出了本公开。本公开的实施例提供了一种音量控制方法、装置、存储介质和电子设备。
本公开的第一个方面,提供了一种音量控制方法,包括:获取空间区域内的包括空间区域内的人员的图像帧序列;基于所述图像帧序列中的各图像帧,确定所述各图像帧中的虚拟麦克风区域和目标人员;基于所述各图像帧,确定所述各图像帧中的目标人员的嘴部区域;基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定所述目标人员的嘴部区域与虚拟麦克风区域之间的距离;获取空间区域内的语音信号,基于所述语音信号确定所述目标人员的人声音频;根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的所述人声音频的播放音量。
本公开的第二个方面,提供了一种音量控制系统,包括:位于空间区域内的语音采集装置,图像采集装置,音频播放装置,控制器,其中,所述音频播放装置用于在控制器控制下播放音频,所述控制器用于执行本公开第一方面实施例提出的方法。
本公开的第三个方面,提供了一种音量控制装置,包括:第一获取模块,用于获取空间区域内的包括空间区域内的人员的图像帧序列;第一确定模块,用于基于所述图像帧序列中的各图像帧,确定所述各图 像帧中的虚拟麦克风区域和目标人员;第二确定模块,用于基于所述各图像帧,确定所述各图像帧中的目标人员的嘴部区域;第三确定模块,用于基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定所述目标人员的嘴部区域与虚拟麦克风区域之间的距离;第二获取模块,用于获取空间区域内的语音信号,基于所述语音信号确定所述目标人员的人声音频;音量调整模块,用于根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量。
本公开的第四个方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行本公开第一方面实施例提出的方法。
本公开的第五个方面,提供了一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述可执行指令以实现本公开第一方面实施例提出的方法。
本公开实施例中通过检测目标人员的嘴部和虚拟麦克风之间的距离变化,并根据检测到的距离变化及时调整目标人员的人声音频的播放音量,实现了人声播放音量的简单、快捷的调整,进而提高了用户的歌唱体验效果。
附图说明
图1是本公开所适用的场景图;
图2是本公开一示例性实施例提供的音量控制方法的流程示意图;
图3是本公开一示例性实施例提供的步骤S202的流程示意图;
图4是本公开一示例性实施例提供的步骤S203的流程示意图;
图5是本公开一示例性实施例提供的一图像帧中脸部关键点的示意图;
图6是本公开一示例性实施例提供的步骤S204的流程示意图;
图7是本公开一示例性实施例提供的步骤S205的流程示意图;
图8是本公开一示例性实施例提供的音量控制系统的结构程示意图;
图9是本公开一示例性实施例提供的音量控制装置的结构示意图;
图10是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
申请概述
在实现本公开的过程中,发明人发现,由于使用了虚拟麦克风的唱歌系统中的虚拟麦克风并无收音装置或音量调节装置,导致唱歌系统无法通过虚拟麦克风调节人声播放音量,导致用户体验感差。
示例性系统
本公开的技术方案可以应用于辅助使用虚拟麦克风的唱歌系统进行人声音量的调节。例如,使用虚拟麦克风的唱歌系统的场景可以为车辆内部、mini KTV等。图1示出了本公开的一个应用场景。如图1所示,使用虚拟麦克风的唱歌系统可以包括语音采集装置、图像采集装置、音频播放装置和控制器。语音采集装置、图像采集装置和音频播放装置与控制器通讯连接。图像采集装置可以为单目眼摄像头、双目摄像头或TOF(Time of Flight)摄像头等,语音采集装置可以是麦克风或是麦克风阵列等,音频播放装置可以扬声器或音箱设备等,控制器可以为计算平台或服务器等。
本公开可以通过图像采集装置获得空间区域中的图像帧序列。通过语音采集装置获得空间区域中的语音信号。将采集的语音信号和图像帧序列发送给控制器,控制器对图像帧序列和语音信号进行处理,得到目标人员的嘴部区域与虚拟麦克风之间的距离,通过嘴部区域与虚拟麦克风区域之间的距离得到目标人员的人声音频的播放音量,并控制音频播放装置以播放音量播放目标人员的人声音频。
本公开中通过根据目标人员的嘴部区域和虚拟麦克风之间的距离变化,并根据距离变化调整目标人员的人声音频的播放音量,实现了对人声播放音量的简单、快捷调整,进而提高了用户的歌唱体验效果。
示例性方法
图2本公开一示例性实施例提供的一种音量控制方法的流程示意图。本实施例可应用在电子设备上,如图2所示,包括如下步骤:
步骤S201,获取空间区域内的包括空间区域内的人员的图像帧序列。
其中,空间区域可以为进行唱歌的空间,例如空间区域可以为车辆内部空间、mini KTV包房内部空间等。
示例性的,可以通过空间区域中设置的图像采集装置采集空间区域中的视频,然后通过图像识别技术识别出包括有空间区域内的人员的图像帧,然后将包括有空间区域内的人员的图像帧按照时间顺序排列, 得到图像帧序列。其中,在对空间区域内的人员进行识别时,可以当识别出图像帧中的人体的特定部位(例如,脸部、头部或是躯干部等)时,确定该图像帧中包括有空间区域内的人员。
步骤S202,基于图像帧序列中的各图像帧,确定各图像帧中的虚拟麦克风区域和目标人员。
其中,利用图像识别技术对图像帧序列中的各图像帧进行识别,确定各图像帧中的目标人员和虚拟麦克风区域。虚拟麦克风可以为预设定的手势或者手持的物体(例如,水瓶或者手机等)。目标人员为空间区域内正在唱歌的人员。当图像帧中包括多个人员时,需要对每一人员进行识别,确定其是否为目标人员。
需要说明是的,步骤S201中对各图像帧的识别可以是粗略的图像识别,目的是在可以确定出图像帧是否有空间区域内的人员。步骤S202中采用的图像识别方式,相较于步骤S201中采用的图像识别方式而言,图像识别精度更高,需要确定出图像帧中的目标人员和虚拟麦克风区域,以便后续步骤基于各图像帧中的目标人员和虚拟麦克风区域进行进一步的后续处理。
步骤S203,基于图像帧序列中的各图像帧,确定各图像帧中的目标人员的嘴部区域。
其中,通过步骤S202识别出各图像帧中的目标人员,利用图像识别技术对各图像帧中的目标人员进行识别,确定各图像帧中的目标人员的嘴部区域。一个示例性的,可以通过训练好的用于识别嘴部区域的神经网络对各图像帧中的目标人员进行识别,以得到各图像帧中的目标人员的嘴部区域,该神经网络可以为快速区域卷积神经网络(Faster Region Convolutional Neural Networks,Faster-RCNN),YOLO(You Only Look Once)等。另一个示例性的,可以通过训练好的用于识别脸部的神经网络确定各图像帧中的目标人员的脸部关键点,根据各图像帧中目标人员的脸部关键点确定各图像帧中脸部关键点,基于各图像帧中的目标人员的脸部关键点确定各图像像帧中目标人员的嘴部关键点,根据各图像帧中的目标人员的嘴部关键点,确定各图像帧中的目标人员的嘴部区域。再一个示例性的,可以通过训练好的用于识别脸部的神经网络对各图像帧中的目标人员识别,得到各图像帧中的目标人员的脸部图像,检测各图像帧中的目标人员的脸部图像中的嘴部区域是否存在遮挡,当检测到存在遮挡时,可以确定脸部图像的预设位置为嘴部区域,从而得到目标人员的嘴部区域,预设位置可以脸部图像的下部等;当不存在遮挡时,可以通过神经网络等确定目标人员的嘴部区域。
步骤S204,基于各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离。
其中,基于步骤S202确定的各图像帧中的虚拟麦克风区域和步骤S203确定的各图像帧中的目标人员的嘴部区域,计算各图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离。通过各图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离,确定出目标人员的嘴部区域和虚拟麦克风区域之间的距离。示例性的,可以先获取各图像帧中的嘴部区域的第一预设点和虚拟麦克风区域的第二预设点,如将各 图像帧中嘴部区域中的下唇区域的中心点作为第一预设点,将各图像帧中虚拟麦克风区域中的顶部作为第二预设点,通过计算各图像帧中第一预设点和第二预设点之间的距离,得到目标人员的嘴部区域和虚拟麦克风区域之间的距离。其中,目标人员的嘴部区域和虚拟麦克风区域之间的距离可以为欧氏距离、曼哈顿距离、切比雪夫距离、闵可夫斯基距离或马氏距离等。其中,确定出的目标人员的嘴部区域与虚拟麦克风区域之间的距离可以为各图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离,也可以是根据各图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离确定的目标人员的嘴部区域与虚拟麦克风区域之间的最终距离。以上示例用于对本实施例进行说明,在实际应用时,可以根据实际需求设定。
步骤S205,获取空间区域内的语音信号,基于语音信号确定目标人员的人声音频。
其中,空间区域中设置有语音采集装置。通过空间区域中设置的语音采集装置采集空间区域内的音频信号。音频信号包括语音信号和噪声信号,语音信号包括空间区域内部的人员的人声音频。
可以通过音频降噪等技术对语音采集装置采集的音频信号进行人声分离,得到语音信号。根据步骤S202确定各图像帧中的目标人员,确定空间区域内的目标人员的位置,根据空间区域内的目标人员的位置,通过音区定位技术确定语音信号中每一人声音频对应的音区,建立人声音频与音区的对应关系,根据目标人员的位置和音区的位置,确定目标人员对应的音区,根据目标人员对应的音区以及人声音频与音区的对应关系,确定目标人员的人声音频,并提取该人声音频。
步骤S206,根据目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整目标人员的人声音频的播放音量。
其中,根据空间区域内的目标人员的嘴部区域与虚拟麦克风区域之间的距离与播放音量之间的对应关系,确定出空间区域内的目标人员的人声音频的播放音量,并控制音频播放装置以确定的播放音量播放目标人员的人声音频。示例性的,目标人员的嘴部区域与虚拟麦克风区域之间的距离与播放音量之间的对应关系可以预先设置为每一距离对应一播放音量,如距离为5cm、10cm、15cm分别对应的播放音量为20dB(分贝)、15dB、10dB等。还可以设置一个距离与播放音量之间一一对应的公式,根据该公式计算每一距离对应的播放音量。以上示例用于对本实施例进行说明,在实际应用时,可以根据实际需求设定。
示例性的,可以通过一个图像帧序列确定一个播放音量,将目标人员的人声音频以该播放音量播放,此时可以根据各图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离,确定目标人员的嘴部区域与虚拟麦克风区域之间的最终距离,根据目标人员的嘴部区域与虚拟麦克风区域之间的最终距离确定标人员的人声音频的播放音量,以该播放音量播放目标人员的人声音频。还可以根据图像帧序列中的各图像帧的目标人员的嘴部区域与虚拟麦克风区域之间的距离,确定各图像帧的目标人员的嘴部区域与虚拟麦克风区域之间的距离对应的目标人员的播放音量,以各图像帧中的目标人员的播放音量播放目标人员的人声音 频。
本公开实施例中通过对图像帧序列中的目标人员的嘴部区域和虚拟麦克风区域的距离,确定目标人员的人声音频的播放音量,并调整人声音频以该播放音量播放,从而实现了通过虚拟麦克风对人声播放音量简单、快捷的控制,进而提高了用户的歌唱体验效果。
在本公开的一个实施例中,如图3所示,步骤202可包括如下步骤:
步骤S2021,对图像帧序列中的各图像帧进行识别,确定各图像帧中的手持虚拟麦克风的图像区域。
步骤S2022,基于各图像帧中的手持虚拟麦克风的图像区域,确定各图像帧中的虚拟麦克风区域,将各图像帧中持有虚拟麦克风的人员确定为各图像帧中的目标人员。
其中,当虚拟麦克风为手持的物体时,可以通过训练好的用于感兴趣区域识别的神经网络识别出各图像帧中的手持虚拟麦克风的图像区域,提取各图像帧中的手持虚拟麦克风的图像区域,之后对各图像帧中的手持虚拟麦克风的图像区域识别,得到各图像帧中的虚拟麦克风区域。当虚拟麦克风为预设定的手势时,可以通过神经网络识别各图像帧中具有预设定的手势的手部区域,然后将各图像帧中具有预设定的手势的手部区域确定为各图像帧的虚拟麦克风区域。神经网络可以卷积神经网络(Convolutional Neural Networks,CNN)或快速区域卷积神经网络。在空间区域中的人员为单个时,当识别出任一图像帧中具有手持虚拟麦克风的图像区域时,将该图像帧中的人员确定为该图像帧的目标人员。
当空间区域中的人员为多个时,建立各图像帧中的人员与其手部区域的对应关系。当虚拟麦克风为手持的物体时,根据各图像帧中的手持虚拟麦克风的图像区域,确定各图像帧中的持有虚拟麦克风的手部区域,基于各图像帧中的人员与其手部区域的对应关系,得到各图像帧中持有虚拟麦克风的人员,将各图像帧中持有虚拟麦克风的人员确定为各图像帧的目标人员;当虚拟麦克风为预设定的手势时,基于各图像帧中人员与其手部区域的对应关系,将各图像帧中对应预设定的手势的手部区域的人员确定为各图像帧中的持有虚拟麦克风的人员,将各图像帧中持有虚拟麦克风的人员确定为各图像帧的目标人员。
本公开实施例中,通过对各图像帧中的手持虚拟麦克风的图像区域进行识别,通过各图像帧中的手持虚拟麦克风的图像区域确定目标人员和虚拟麦克风区域,实现了准确的对各图像帧中目标人员和虚拟麦克风区域的识别。
在本公开一个实施例中,如图4所示,步骤203可包括如下步骤:
步骤S2031,获取各图像帧中的目标人员的嘴部关键点。
其中,可以通过训练好的用于识别脸部关键点的神经网络确定各图像帧中的目标人员的脸部关键点。该神经网络可以为卷积神经网络、快速区域卷积神经网络或YOLO等。脸部关键点包括有嘴部关键点、眼部关键点、鼻部关键点和脸部轮廓关键点,在每一图像帧中可以根据脸部关键点确定出嘴部关键点,图5 示出了一个图像帧中的目标人员的脸部关键点的示意图,如图5所示,脸部共有68个关键点,每一关键点对应一个序号,根据序号与脸部位置的对应关系,得到该图像帧中的目标人员的嘴部关键点,在图5中,序号49~68的关键点为嘴部关键点。也可以直接通过神经网络确定各图像帧中的目标人员的嘴部关键点。
步骤S2032,根据各图像帧中的目标人员的嘴部关键点,确定各图像帧中的目标人员的嘴部区域。
其中,每一嘴部关键点具有位置信息,该位置信息可以是嘴部关键点的坐标值。可以根据各图像帧中嘴部关键点的位置信息确定各图像帧中的目标人员的嘴部区域。示例性的,可以根据嘴部关键点的位置信息,在嘴部关键点外部形成外接检测框,该外接检测框中包括嘴部关键点。将该外接检测的区域确定为嘴部区域。该外接检测框可以为矩形也可以为其他形状。
本公开实施例中,先确定各图像帧中的目标人员的嘴部关键点,然后通过嘴部关键点确定目标人员的嘴部区域,为实现快速准确确定目标人员嘴部关键点提供了一种实现方式。
在本公开一个实施例中,如图6所示,步骤204可包括如下步骤:
步骤S2041,确定各图像帧中的目标人员的嘴部区域的第一预设标识点。
其中,在每一图像帧中,可以利用神经网络等获取各图像帧中的目标人员的嘴部区域的嘴部关键点,将目标人员的嘴部关键点中的任意一个嘴部关键点作为该图像帧中的目标人员的嘴部区域的第一预设标识点。示例性的,可以将各图像帧中的目标人员的嘴部区域中的上唇中心位置、下唇中心位置、嘴角位置、中心位置、上唇顶部位置或下唇顶部位置等作为第一预设标识点。需要注意的是,各图像帧中的嘴部区域的第一预设标识点为相同的点。例如,各图像帧中的目标人员的嘴部区域的上唇中心位置对应的点作为各图像帧中的目标人员的嘴部区域的第一预设标识点。
步骤S2042,确定各图像帧中的虚拟麦克风区域的第二预设标识点。
其中,在每一图像帧中可以将虚拟麦克风区域的任意一个位置作为该虚拟麦克风区域的第二预设标识点。示例性的,可以确定各图像帧中的虚拟麦克风区域的顶点位置、中心位置、上部区域的中心位置或下部区域的中心位置等作为第二预设点。需要注意的是,各图像帧中的虚拟麦克风区域的第二预设标识点为相同的点。例如,确定各图像帧中的虚拟麦克风区域的上部区域的中心位置对应的点作为各图像帧中的虚拟麦克风区域的第二预设标识点。
步骤S2043,根据各图像帧中的第一预设标识点与第二预设标识点,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离。
其中,在每一图像帧中,根据第一预设标识点的坐标值和第二预设标识点的坐标值,确定该图像帧中第一预设标识点和第二预设标识点之间的距离。示例性的,可以通过一个图像帧序列确定一个播放音量,将目标人员的人声音频以该播放音量播放,此时可以将各图像帧中第一预设标识点和第二预设标识点之间 的距离的平均值确定为目标人员的嘴部区域与虚拟麦克风区域之间的最终距离,也可以将各图像帧中的第一预设标识点和第二预设标识之间的距离输入训练好的用于确定距离的神经网络,得到目标人员的嘴部区域与虚拟麦克风区域之间的最终距离;根据目标人员的嘴部区域与虚拟麦克风区域之间的最终距离,确定出目标人员的人声音频的播放音量。还可以根据各图像帧的目标人员的第一预设标识点与第二预设标识点,确定出各图像帧的目标人员的嘴部区域与虚拟麦克风区域之间的距离,根据各图像帧的目标人员的嘴部区域与虚拟麦克风区域之间的距离,确定各图像帧中的目标人员的的播放音量,以各图像帧中的目标人员的播放音量播放目标人员的人声音频。
在本公开的一个实施例中,步骤S2041包括:针对各图像帧中的嘴部区域,基于目标人员的嘴部区域或嘴部关键点,确定目标人员的嘴部区域的中心点为目标人员的嘴部区域的第一预设标识点。
其中,在每一图像帧中,可以根据目标人员的嘴部区域的外接检测框的顶点的坐标值,确定目标人员的嘴部区域的中心点,还可以通过确定目标人员的嘴部区域的外部轮廓数据,根据目标人员的嘴部区域的外部轮廓数据,确定目标人员的嘴部区域的中心点,也可以通过目标人员的嘴部区域中的嘴部关键点的坐标值确定目标人员的嘴部区域的中心点;再将该嘴部区域的中心点确定为目标人员的嘴部区域的第一预设标识点。
在本公开的一个实施例中,步骤S2042包括:针对各图像帧中的虚拟麦克风的区域,基于虚拟麦克风区域,确定虚拟麦克风区域的中心点为虚拟麦克风区域的第二预设标识点。
其中,在每一图像帧中,可以通过虚拟麦克风区域的外接检测框的顶点的坐标值,确定虚拟麦克风区域的中心点,还可以通过确定虚拟麦克风区域的外部轮廓数据,根据虚拟麦克风区域的外部轮廓数据,确定虚拟麦克风区域的中心点;再将该虚拟麦克风区域的中心点确定为该虚拟麦克风区域的第二预设标识点。
示例性的,在一个图像帧中,获取目标人员的嘴部区域的第一预设标识点的坐标值和虚拟麦克风区域的第二预设点的坐标值。根据公式(1)计算出该图像帧中的目标人员的嘴部区域与虚拟麦克风区域之间的距离。
其中,(x1,y1,z1)为第一预设标识点的坐标值,(x2,y2,z2)为第二预设标识点的坐标值,d为嘴部区域与虚拟麦克风区域之间的距离。
在本公开的一个实施例中,如图7所示,步骤205可包括如下步骤:
步骤S2051,基于语音信号进行语音分离,获取空间区域内的人员的人声音频信息。
其中,空间区域内的人员的人声音频信息包括:人员的人声音频和人声音频对应的音区。通过对音频采集装置采集的音频信号进行声学降噪处理,以得空间区域内的人员的语音信号,同时基于声源定位技术 确定空间区域内的人员的人声音频的音区。
音频信号的声学降噪处理可以包括:先获取参考信号,根据参考信息,对音频信号进行声反馈处理,以消除音频信号中的啸叫,其中可以通过啸叫抑制算法对音频信号进行声反馈处理,参考信号为用于播放人声音频的音频播放装置的播放信号;然后,对经过声反馈处理的音频信号进行降噪处理,以消除语音信号中噪音,得到干净的空间区域内的人员的语音信号,该语音信号包括空间区域内所有人员的人声音频,其中可以采用谱减法和OMLSA(Optimally-modified Log-spectral Amplitude)算法对音频信号进行降噪处理。可以通过音源定位技术确定每个人员的人声音频对应的音区,并建立人声音频与音区的对应关系。
步骤S2052,基于各图像帧中的目标人员,确定各图像帧中的目标人员的位置。
其中,根据步骤S202确定各图像帧中的目标人员,获取各图像帧中的目标人员的位置。示例性的,可以提取各图像帧中的目标人员的区域图像,将各图像帧中的目标人员的区域图像输入训练好的神经网络,得到各图像帧中目标人员的位置。
步骤S2053,基于各图像帧中的目标人员的位置和人声音频信息,确定目标人员的人声音频。
其中,可以通过各图像帧中的目标人员的位置,确定目标人员的最终位置。示例性的,可以对各图像帧中的目标人员的位置进行加和求平均,得到空间区域内的目标人员的最终位置,也可以将各图像帧中的目标人员的位置输入神经网络,得到空间区域内的目标人员的最终位置。根据目标人员的最终位置和人声音频信息中的音区的位置,确定目标人员对应的音区;根据人声音频与音区的对应关系,提取该音区对应的人声音频,即得到目标人员的人声音频。还可以在各图像帧中选择预设定的图像帧作为关键图像帧,根据关键图像帧中的目标人员的位置确定目标人员对应的音区,提取该音区对应的人声音频,即得到目标人员的人声音频。
本公开实施例中,根据语音信号得到空间区域内的人员的人声音频和人声音频对应的音区,之后基于空间区域内的目标人员的位置以及空间区域内的人员的人声音频和人声音频对应的音区,确定空间区域内的目标人员的人声音频。实现了快速准确的对目标人员的人声音频的确定。
在本公开的一个实施例中,步骤S206包括:基于预设的距离与播放音量之间的对应关系,根据目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整目标人员的人声音频的播放音量。
其中,可以预先设置距离与播放音量之间的对应关系,根据距离与播放音量之间的对应关系、以及目标人员的嘴部区域与虚拟麦克风区域之间的距离,确定目标人员的人声音频的播放音量。例如,设定一个基准音量距离,该基准音量距离下播放音量不调整,假设基准音量距离为5cm,则预设的距离与播放音量之间的对应关系为v=20log10(0.05/d),v表示播放音量,其单位可以为dB,d表示嘴部区域与虚拟麦克风区域之间的距离,其单位可以为m(米)。以上仅仅是本实施例的一个例子,在实际使用中,可以调整不同的 参数达到最优的体验。
在本公开的一个实施例中,还包括:将目标人员的人声音频与伴奏音频混合,通过空间区域内的音频播放装置以所述播放音量播放。其中将空间区域内的目标人员的人声音频与伴奏音频混合,得到混合伴奏人声音频,将混合伴奏人声音频中的目标人员的人声音频通过空间区域内的音频播放装置以目标人员的人声音频的播放音量播放。其中,混合伴奏人声中伴奏音频可以通过空间区域内的音频播放装置以预设播放音量播放,也可以通过空间区域内的音频播放装置跟随人声音频的播放音量调整后播放。
本公开实施例提供的任一种音量控制方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种音量控制方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种音量控制方法。下文不再赘述。
示例性音量控制系统
图8是本公开一个实施例中音量控制系统的结构框图。如图8所示,包括:位于空间区域内的语音采集装置,图像采集装置,音频播放装置,控制器,其中,音频播放装置用于在控制器控制下播放音频,控制器用于执行所述的音量控制方法。
在本公开的一个实施例中,图像采集装置用于采集空间区域内的图像帧序列,音频采集装置用于采集空间区域内的语音信号,控制器用于处理图像帧序列和语音信号,以得到空间区域内的目标人员的人声音频的播放音量,并控制音频播放装置以播放音量播放空间区域的目标人员的人声音频。
示例性装置
图9是本公开一个实施例中音量控制装置的结构框图。如图9所示,音量控制装置包括:第一获取模块100、第一确定模块101、第二确定模块102、第三确定模块103、第二获取模块104、音量调整模块105。
第一获取模块100,用于获取空间区域内的包括空间区域内的人员的图像帧序列;
第一确定模块101,用于基于所述图像帧序列中的各图像帧,确定所述各图像帧中的虚拟麦克风区域和目标人员;
第二确定模块102,用于基于所述各图像帧,确定所述各图像帧中的目标人员的嘴部区域;
第三确定模块103,用于基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离;
第二获取模块104,用于获取空间区域内的语音信号,基于所述语音信号确定所述目标人员的人声音频;
音量调整模块105,用于根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标 人员的所述人声音频的播放音量。
在本公开的一个实施例中,第一确定模块101包括:
第一确定子模块,用于对所述图像帧序列中的各图像帧进行识别,确定所述各图像帧中的手持虚拟麦克风的图像区域;
第二确定子模块,用于基于所述各图像帧中的手持虚拟麦克风的图像区域,确定所述各图像帧中的虚拟麦克风区域,将所述各图像帧中持有所述虚拟麦克风的人员确定为所述各图像帧中的目标人员。
在本公开的一个实施例中,第二确定模块102包括:
第三确定子模块,用于获取所述各图像帧中的目标人员的嘴部关键点;
第四确定子模块,用于根据所述各图像帧中的目标人员的嘴部关键点,确定所述各图像帧中的目标人员的嘴部区域。
在本公开的一个实施例中,第三确定模块103包括:
第四确定子模块,用于确定所述各图像帧中的目标人员的嘴部区域的第一预设标识点;
第五确定子模块,用于确定所述各图像帧中的虚拟麦克风区域的第二预设标识点;
第六确定子模块,用于根据所述各图像帧中的所述第一预设标识点与所述第二预设标识点,确定所述目标人员的嘴部区域与虚拟麦克风区域之间的距离。
在本公开的一个实施例中,第四确定子模块,还用于针对所述各图像帧中的嘴部区域,基于所述目标人员的嘴部区域或嘴部关键点,确定所述目标人员的嘴部区域的中心点为所述目标人员的嘴部区域的第一预设标识点;
第五确定子模块,还用于针对所述各图像帧中的虚拟麦克风的区域,基于所述虚拟麦克风区域,确定所述麦克风区域的中心点为所述虚拟麦克风区域的第二预设标识点。
在本公开的一个实施例中,第二获取模块104包括:
第一获取子模块,用于基于所述语音信号进行语音分离,获取空间区域内的人员的人声音频信息,所述人员的人声音频信息包括:人员的人声音频和人声音频对应的音区;
第六确定子模块,用于基于所述各图像帧中的目标人员,确定所述空间区域内的目标人员的位置;
第七确定子模块,用于基于所述空间区域内的目标人员的位置和所述人声音频信息,确定所述目标人员的人声音频。
在本公开的一个实施例中,音量调整模块105还用于基于预设的距离与播放音量之间的对应关系,根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量。
在本公开的一个实施例中,所述音量控制装置还包括:
混合模块,用于将所述目标人员的人声音频与伴奏音频混合,通过空间区域内的音频播放装置以所述播放音量播放。
示例性电子设备
下面,参考图10来描述根据本公开实施例的电子设备。如图10所示,电子设备10包括一个或多个处理器11和存储器12。
处理器11可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备中的其他组件以执行期望的功能。
存储器12可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器11可以运行所述程序指令,以实现上文所述的本公开的各个实施例的音量控制方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备10还可以包括:输入装置13和输出装置14,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,输入装置13可以是上述的麦克风或麦克风阵列,用于捕捉声源的输入信号。此外,该输入设备13还可以包括例如键盘、鼠标等等。
该输出装置14可以向外部输出各种信息,包括确定出的距离信息、方向信息等。该输出设备14可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图10中仅示出了该电子设备中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的音量控制方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序 设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的音量控制方法中的步骤。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本领域的技术人员可以对本公开进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本公开权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。

Claims (12)

  1. 一种音量控制方法,包括:
    获取空间区域内的包括空间区域内的人员的图像帧序列;
    基于所述图像帧序列中的各图像帧,确定所述各图像帧中的虚拟麦克风区域和目标人员;
    基于所述各图像帧,确定所述各图像帧中的目标人员的嘴部区域;
    基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离;
    获取空间区域内的语音信号,基于所述语音信号确定所述目标人员的人声音频;
    根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量。
  2. 根据权利要求1所述的方法,其中,所述基于所述图像帧序列中的各图像帧,确定所述各图像帧中的虚拟麦克风区域和目标人员,包括:
    对所述图像帧序列中的各图像帧进行识别,确定所述各图像帧中的手持虚拟麦克风的图像区域;
    基于所述各图像帧中的手持虚拟麦克风的图像区域,确定所述各图像帧中的虚拟麦克风区域,将所述各图像帧中持有所述虚拟麦克风的人员确定为所述各图像帧中的目标人员。
  3. 根据权利要求1所述的方法,其中,基于所述图像帧序列中的各图像帧,确定所述各图像帧中的目标人员的嘴部区域,包括:
    获取所述各图像帧中的目标人员的嘴部关键点;
    根据所述各图像帧中的目标人员的嘴部关键点,确定所述各图像帧中的目标人员的嘴部区域。
  4. 根据权利要求3所述的方法,其中,所述基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离,包括:
    确定所述各图像帧中的目标人员的嘴部区域的第一预设标识点;
    确定所述各图像帧中的虚拟麦克风区域的第二预设标识点;
    根据所述各图像帧中的所述第一预设标识点与所述第二预设标识点,确定所述目标人员的嘴部区域与虚拟麦克风区域之间的距离。
  5. 根据权利要求4所述的方法,其中,所述确定所述各图像帧中的嘴部区域的第一预设标识点,包括:
    针对所述各图像帧中的嘴部区域第一预设标识点,基于所述目标人员的嘴部区域或嘴部关键点,确定所述目标人员的嘴部区域的中心点为所述目标人员的嘴部区域的第一预设标识点;
    所述确定所述各图像帧中的虚拟麦克风的第二预设标识点,包括:
    针对所述各图像帧中的虚拟麦克风的区域第二预设标识点,基于所述虚拟麦克风区域,确定所述麦克 风区域的中心点为所述虚拟麦克风区域的第二预设标识点。
  6. 根据权利要求1-5中任一项所述的方法,其中,所述基于所述语音信号,确定所述目标人员的人声音频,包括:
    基于所述语音信号进行语音分离,获取空间区域内的人员的人声音频信息,所述人员的人声音频信息包括:人员的人声音频和人声音频对应的音区;
    基于所述各图像帧中的目标人员,确定所述各图像帧中的目标人员的位置;
    基于所述各图像帧中的目标人员的位置和所述人声音频信息,确定所述目标人员的人声音频。
  7. 根据权利要求1-5中任一项所述的方法,其中,所述根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量,包括:
    基于预设的距离与播放音量之间的对应关系,根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量。
  8. 根据权利要求1-5中任一项所述的方法,所述目标人员的人声音频的播放音量之后,还包括:
    将所述目标人员的人声音频与伴奏音频混合,通过空间区域内的音频播放装置以所述播放音量播放。
  9. 一种音量控制系统,包括:
    位于空间区域内的语音采集装置,图像采集装置,音频播放装置,控制器,其中,所述音频播放装置用于在控制器控制下播放音频,所述控制器用于执行权利要求1-8任一项所述的方法。
  10. 一种音量控制装置,包括:
    第一获取模块,用于获取空间区域内的包括空间区域内的人员的图像帧序列;
    第一确定模块,用于基于所述图像帧序列中的各图像帧,确定所述各图像帧中的虚拟麦克风区域和目标人员;
    第二确定模块,用于基于所述各图像帧,确定所述各图像帧中的目标人员的嘴部区域;
    第三确定模块,用于基于所述各图像帧中的目标人员的嘴部区域和虚拟麦克风区域,确定目标人员的嘴部区域与虚拟麦克风区域之间的距离;
    第二获取模块,用于获取空间区域内的语音信号,基于所述语音信号确定所述目标人员的人声音频;
    音量调整模块,用于根据所述目标人员的嘴部区域与虚拟麦克风区域之间的距离,调整所述目标人员的人声音频的播放音量。
  11. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-8任一项所述的方法。
  12. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述可执行指令以实现上述1-8任一项所述的方法。
PCT/CN2023/087019 2022-04-08 2023-04-07 音量控制方法、装置、存储介质和电子设备 WO2023193803A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210368353.1 2022-04-08
CN202210368353.1A CN114911449A (zh) 2022-04-08 2022-04-08 音量控制方法、装置、存储介质和电子设备

Publications (1)

Publication Number Publication Date
WO2023193803A1 true WO2023193803A1 (zh) 2023-10-12

Family

ID=82763803

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087019 WO2023193803A1 (zh) 2022-04-08 2023-04-07 音量控制方法、装置、存储介质和电子设备

Country Status (2)

Country Link
CN (1) CN114911449A (zh)
WO (1) WO2023193803A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911449A (zh) * 2022-04-08 2022-08-16 南京地平线机器人技术有限公司 音量控制方法、装置、存储介质和电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009225379A (ja) * 2008-03-18 2009-10-01 Fujitsu Ltd 音声処理装置、音声処理方法、音声処理プログラム
US20140126754A1 (en) * 2012-11-05 2014-05-08 Nintendo Co., Ltd. Game system, game process control method, game apparatus, and computer-readable non-transitory storage medium having stored therein game program
CN105245811A (zh) * 2015-10-16 2016-01-13 广东欧珀移动通信有限公司 一种录像方法及装置
CN107534725A (zh) * 2015-05-19 2018-01-02 华为技术有限公司 一种语音信号处理方法及装置
CN111932619A (zh) * 2020-07-23 2020-11-13 安徽声讯信息技术有限公司 结合图像识别和语音定位的麦克风跟踪系统及方法
CN112423191A (zh) * 2020-11-18 2021-02-26 青岛海信商用显示股份有限公司 一种视频通话设备和音频增益方法
CN114170559A (zh) * 2021-12-14 2022-03-11 北京地平线信息技术有限公司 车载设备的控制方法、装置和车辆
CN114911449A (zh) * 2022-04-08 2022-08-16 南京地平线机器人技术有限公司 音量控制方法、装置、存储介质和电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009225379A (ja) * 2008-03-18 2009-10-01 Fujitsu Ltd 音声処理装置、音声処理方法、音声処理プログラム
US20140126754A1 (en) * 2012-11-05 2014-05-08 Nintendo Co., Ltd. Game system, game process control method, game apparatus, and computer-readable non-transitory storage medium having stored therein game program
CN107534725A (zh) * 2015-05-19 2018-01-02 华为技术有限公司 一种语音信号处理方法及装置
CN105245811A (zh) * 2015-10-16 2016-01-13 广东欧珀移动通信有限公司 一种录像方法及装置
CN111932619A (zh) * 2020-07-23 2020-11-13 安徽声讯信息技术有限公司 结合图像识别和语音定位的麦克风跟踪系统及方法
CN112423191A (zh) * 2020-11-18 2021-02-26 青岛海信商用显示股份有限公司 一种视频通话设备和音频增益方法
CN114170559A (zh) * 2021-12-14 2022-03-11 北京地平线信息技术有限公司 车载设备的控制方法、装置和车辆
CN114911449A (zh) * 2022-04-08 2022-08-16 南京地平线机器人技术有限公司 音量控制方法、装置、存储介质和电子设备

Also Published As

Publication number Publication date
CN114911449A (zh) 2022-08-16

Similar Documents

Publication Publication Date Title
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
EP3614377B1 (en) Object recognition method, computer device and computer readable storage medium
JP5323770B2 (ja) ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機
US20240087587A1 (en) Wearable system speech processing
US6754373B1 (en) System and method for microphone activation using visual speech cues
US20150325240A1 (en) Method and system for speech input
WO2016150001A1 (zh) 语音识别的方法、装置及计算机存储介质
CN107346661B (zh) 一种基于麦克风阵列的远距离虹膜跟踪与采集方法
US20160140964A1 (en) Speech recognition system adaptation based on non-acoustic attributes
EP3956883A1 (en) Identifying input for speech recognition engine
US20120259638A1 (en) Apparatus and method for determining relevance of input speech
US20230386461A1 (en) Voice user interface using non-linguistic input
WO2023193803A1 (zh) 音量控制方法、装置、存储介质和电子设备
JP2003216955A (ja) ジェスチャ認識方法、ジェスチャ認識装置、対話装置及びジェスチャ認識プログラムを記録した記録媒体
WO2020125038A1 (zh) 语音控制方法及装置
CN108877787A (zh) 语音识别方法、装置、服务器及存储介质
CN111341350A (zh) 人机交互控制方法、系统、智能机器人及存储介质
WO2019171780A1 (ja) 個人識別装置および特徴収集装置
CN114779922A (zh) 教学设备的控制方法、控制设备、教学系统和存储介质
WO2019153382A1 (zh) 智能音箱及播放控制方法
WO2021166811A1 (ja) 情報処理装置および行動モード設定方法
WO2021134250A1 (zh) 情绪管理方法、设备及计算机可读存储介质
JP2001067098A (ja) 人物検出方法と人物検出機能搭載装置
JP2021076715A (ja) 音声取得装置、音声認識システム、情報処理方法、及び情報処理プログラム
KR20210039583A (ko) 멀티모달 기반 사용자 구별 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23784365

Country of ref document: EP

Kind code of ref document: A1