CN117351976A - Display device, voice processing method and device - Google Patents

Display device, voice processing method and device Download PDF

Info

Publication number
CN117351976A
CN117351976A CN202210753751.5A CN202210753751A CN117351976A CN 117351976 A CN117351976 A CN 117351976A CN 202210753751 A CN202210753751 A CN 202210753751A CN 117351976 A CN117351976 A CN 117351976A
Authority
CN
China
Prior art keywords
target
angle
voice
parameter
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210753751.5A
Other languages
Chinese (zh)
Inventor
伊子旭
胡永双
丁强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Hisense Commercial Display Co Ltd
Original Assignee
Qingdao Hisense Commercial Display Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Hisense Commercial Display Co Ltd filed Critical Qingdao Hisense Commercial Display Co Ltd
Priority to CN202210753751.5A priority Critical patent/CN117351976A/en
Publication of CN117351976A publication Critical patent/CN117351976A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application provides a display device, a voice processing method and a voice processing device, wherein the display device comprises a controller, an image acquisition device and a voice acquisition device, wherein the controller is configured to acquire an image to be recognized, process the image to be recognized to acquire a mapping table, process voice to be processed by a beam forming algorithm, determine an incoming wave angle corresponding to the voice to be processed, and acquire a target distance corresponding to a target angle parameter if the target angle parameter corresponding to the incoming wave angle exists in the mapping table; the method comprises the steps of determining target reverberation parameters and target gain parameters corresponding to target distances, determining target voices corresponding to voices to be processed according to the target reverberation parameters and the target gain parameters, sending the target voices to a target terminal, determining incoming wave directions and corresponding distance information of the voices by processing images of target scenes, performing gain processing and dereverberation processing on collected voice audios according to the distance information, and improving conversation quality.

Description

Display device, voice processing method and device
Technical Field
The embodiment of the application relates to the technical field of display. And more particularly, to a display apparatus, a voice processing method and a device.
Background
With the continuous development of multimedia display technology, through applying the display large screen with voice call function and remote video function in the teleconference, people in different areas can communicate and discuss in real time through the large screen displayed in the conference, and the scene of the teleconference is expanded.
In the prior art, in the process of a teleconference, a large display screen can record the voice of a speaker and transmit the voice to a teleconference terminal. In order to avoid the effect of influencing the recorded voice due to the fact that the position of a speaker is far away from a microphone of a large display screen, the large display screen can improve gain and remove reverberation processing on the recorded voice so as to improve voice quality of real-time transmission in a teleconference process.
However, when the speaker is located closer to the microphone displaying the large screen, the recorded sound is subjected to reverberation removal processing, so that the problem of dryness and weakness of the sound played at the remote conference terminal can be caused, and the conversation effect of the conference is affected.
Disclosure of Invention
According to the display device, the voice processing method and the voice processing device, the incoming wave direction of the voice and the corresponding distance information are determined through processing the image of the target scene, gain processing and dereverberation processing are performed on the collected voice audio according to the distance information, and the conversation quality is improved.
In a first aspect, embodiments of the present application provide a display device, including:
the image acquisition device is used for acquiring an image to be identified of the target scene;
the voice acquisition device is used for acquiring the voice to be processed of the target person;
a controller configured to:
obtaining an image to be identified, and processing the image to be identified to obtain a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
obtaining voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and if a target angle parameter corresponding to the incoming wave angle exists in the mapping table, obtaining a target distance corresponding to the target angle parameter;
determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and transmitting the target voice to a target terminal.
In one possible design, the controller is configured, in performing the determining the target reverberation parameter corresponding to the target distance, to:
If the target distance is smaller than a preset minimum distance, determining that the target reverberation parameter is 1;
if the target distance is greater than or equal to the preset minimum distance, determining a target reverberation parameter corresponding to the target distance according to the following formula:
τ=1+0.01log(10*d)
where τ is the target reverberation parameter and d is the target distance.
In one possible design, the controller is configured, in performing the determining the target gain parameter corresponding to the target distance, to:
determining a target gain parameter corresponding to the target distance according to the following formula:
where θ is the target gain parameter and d is the target distance.
In one possible design, the controller is configured, when executing the determining, according to the target reverberation parameter and the target gain parameter, a target voice corresponding to the voice to be processed, specifically for:
determining a target coefficient according to the product of the target reverberation parameter and the target gain parameter;
and obtaining target voice according to the target coefficient and the voice to be processed.
In one possible design, the controller is configured, when executing the processing of the map to be identified to obtain a mapping table, specifically for:
Performing face recognition processing and face ranging processing on the image to be recognized to obtain at least one face image and distance parameters and position parameters corresponding to each face image;
determining corresponding angle parameters of each face image according to the corresponding position parameters of each face image;
and generating a mapping table according to the distance parameters and the angle parameters corresponding to all the face images.
In one possible design, the controller is configured, after executing the determining the corresponding angle parameter of each face image according to the corresponding position parameter of each face image, to:
obtaining at least one angle interval according to preset angle interval parameters, wherein each angle interval comprises a minimum angle parameter and a maximum angle parameter;
and determining an angle interval matched with the angle parameter corresponding to each face image, taking the average value of the distance parameters corresponding to all face images belonging to the same angle interval as the average value distance corresponding to the angle interval, and generating a mapping table according to all the angle intervals and the corresponding average value distance.
In one possible design, the controller is configured, when executing the target angle parameter corresponding to the incoming wave angle if the target angle parameter exists in the mapping table, to obtain a target distance corresponding to the target angle parameter, to further:
If the target angle interval which accords with the target angle parameter is determined to exist in the mapping table, determining the average distance corresponding to the target angle interval as the target distance corresponding to the target angle parameter, wherein the target angle parameter is larger than or equal to the minimum angle parameter corresponding to the target angle interval, and the target angle parameter is smaller than or equal to the maximum angle parameter corresponding to the target angle interval.
In one possible design the speech acquisition means is a microphone array comprising at least two microphones.
In a second aspect, an embodiment of the present application provides a speech processing method, including:
obtaining an image to be identified, and processing the image to be identified to obtain a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
obtaining voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and if a target angle parameter corresponding to the incoming wave angle exists in the mapping table, obtaining a target distance corresponding to the target angle parameter;
determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and transmitting the target voice to a target terminal.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, including:
the acquisition module is used for acquiring an image to be identified, and processing the image to be identified to acquire a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
the processing module is used for obtaining the voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and obtaining a target distance corresponding to a target angle parameter if the target angle parameter corresponding to the incoming wave angle exists in the mapping table;
the determining module is used for determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and sending the target voice to a target terminal.
According to the display device, the voice processing method and the voice processing device, the positions of all persons participating in the teleconference and shot by the camera are obtained, the distances between the persons participating in the teleconference and the microphones at all positions and the angles relative to the camera are determined, when the fact that the person speaks is detected, the distances are determined according to the positions of the speakers, intelligent gain and dereverberation processing are carried out on the voices of the speakers according to the distances, and the conversation quality of the teleconference is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.
Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device according to an embodiment of the present invention;
fig. 2 exemplarily shows a block diagram of a configuration of a control apparatus in accordance with an exemplary embodiment;
fig. 3 is a hardware configuration block diagram of a control device according to an embodiment of the present invention;
fig. 4 is a schematic diagram of software configuration in a display device according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an icon control interface of an application in a display device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a display device according to an embodiment of the present invention;
fig. 7 is a schematic flow chart of a voice processing method according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a position corresponding to a face image according to an embodiment of the present invention;
fig. 9 is a second schematic flow chart of a voice processing method according to an embodiment of the present invention;
Fig. 10 is a flowchart illustrating a voice processing method according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a speech processing device according to an embodiment of the present invention.
Detailed Description
For purposes of clarity, embodiments and advantages of the present application, the following description will make clear and complete the exemplary embodiments of the present application, with reference to the accompanying drawings in the exemplary embodiments of the present application, it being apparent that the exemplary embodiments described are only some, but not all, of the examples of the present application.
Based on the exemplary embodiments described herein, all other embodiments that may be obtained by one of ordinary skill in the art without making any inventive effort are within the scope of the claims appended hereto. Furthermore, while the disclosure is presented in the context of an exemplary embodiment or embodiments, it should be appreciated that the various aspects of the disclosure may, separately, comprise a complete embodiment.
It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms "first," second, "" third and the like in the description and in the claims and in the above drawings are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated (Unless otherwise indicated). It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to those elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" as used in this application refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the function associated with that element.
Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment of the present invention, as shown in fig. 1, a user may operate the display device 100 through a mobile terminal 300 and a control device 200. The control device 200 may be a remote control, and the communication between the remote control and the display device may include infrared protocol communication, bluetooth protocol communication, wireless or other wired means to control the display device 100. The user may control the display device 100 by inputting user instructions through keys on a remote control, voice input, control panel input, etc. In some embodiments, mobile terminals, tablet computers, notebook computers, and other smart devices may also be used to control the display device 100.
Fig. 2 exemplarily shows a block diagram of a configuration of a control apparatus in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control device 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, which may act as an interaction between the user and the display device 200. The communication interface 130 is configured to communicate with the outside, and includes at least one of a WIFI chip, a bluetooth module, NFC, or an alternative module. The user input/output interface 140 includes at least one of a microphone, a touch pad, a sensor, keys, or an alternative module.
Fig. 3 is a hardware configuration block diagram of a control device according to an embodiment of the present invention. The display apparatus 200 shown in fig. 3 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processor, a video processor, an audio processor, a graphic processor, a RAM, a ROM, and first to nth interfaces for input/output. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box.
In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.
In some embodiments, a "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a form acceptable to the user. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include at least one of a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.
Fig. 4 is a schematic view of software configuration in a display device according to an embodiment of the present invention, as shown in fig. 4, the system is divided into four layers, namely, an application layer (abbreviated as "application layer"), an application framework layer (Application Framework) layer (abbreviated as "framework layer"), an Android run layer (abbreviated as "system runtime layer"), a An Zhuoyun row (Android run) layer and a system library layer (abbreviated as "system runtime layer"), and a kernel layer from top to bottom. The kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.
Fig. 5 is a schematic diagram of an icon control interface display of an application program in a display device according to an embodiment of the present invention, where, as shown in fig. 5, an application program layer includes at least one icon control that an application program may display in a display, for example: a live television application icon control, a video on demand application icon control, a media center application icon control, an application center icon control, a game application icon control, and the like. Live television applications can provide live television through different signal sources. Video on demand applications may provide video from different storage sources. Unlike live television applications, video-on-demand provides video displays from some storage sources. The media center application may provide various applications for playing multimedia content. An application center may be provided to store various applications.
In the teleconference process, the large display screen can record the voice of a speaker and transmit the voice to the teleconference terminal. In order to avoid the effect of influencing the recorded voice due to the fact that the position of a speaker is far away from a microphone of a large display screen, the large display screen can improve gain and remove reverberation processing on the recorded voice so as to improve voice quality of real-time transmission in a teleconference process. However, when the speaker is located closer to the microphone displaying the large screen, the recorded sound is subjected to reverberation removal processing, so that the problem of dryness and weakness of the sound played at the remote conference terminal can be caused, and the conversation effect of the conference is affected.
In order to solve the problem of poor conversation voice quality in a teleconference scene in the prior art, the application provides display equipment, a voice processing method and a voice processing device, wherein the brightness parameters of all target sub-partitions corresponding to each partition are obtained by determining the target sub-partition which is in adjacent position relation with each partition, the backlight compensation parameters of the partition are determined according to the brightness parameters of all target sub-partitions corresponding to each partition, the target backlight value corresponding to the partition is determined according to the backlight value to be adjusted of each partition and the corresponding backlight compensation parameters, and the adjusted image signal is obtained according to the target backlight value corresponding to each partition and the image signal.
The technical scheme of the present application is described in detail below with specific examples. The following embodiments may be combined with each other, and concepts or processes may not be repeated in some embodiments for the same or similar purposes.
Fig. 6 is a schematic structural diagram of a display device according to an embodiment of the present invention, and as shown in fig. 6, the display device according to an embodiment of the present invention includes a controller 601, a display 602, an image acquiring device 603, and a voice acquiring device 604. The image acquisition means 603 is illustratively a camera mounted on the display device. The voice acquisition device 604 is a microphone, and specifically, in order to improve the voice effect of the display device during the voice call, the voice acquisition device is a microphone array including at least two microphones. The speech acquisition device 604 is illustratively a microphone array comprising 6 microphones.
In the embodiment of the present invention, the controller 601 acquires an image to be recognized of the target scene through the image acquisition device 603. The target scene is an application scene of the current display device. Specifically, when the target scene is a teleconference scene, the image to be recognized is all the people participating in the teleconference facing the display device. The controller 601 collects the voice to be processed of the target person through the voice acquisition means 604. Specifically, the target task is a person speaking in the conference.
In the embodiment of the present invention, the controller 601 determines the distances between the conference participants and the microphones at all positions and the angles relative to the camera by obtaining the positions of all the participants in the teleconference captured by the camera, determines the distances according to the positions of the speakers when the presence of the speaker is detected, and performs intelligent gain and dereverberation processing on the voices of the speakers according to the distances, thereby improving the call quality of the teleconference.
The technical scheme of the present application is described in detail below with specific examples. The following embodiments may be combined with each other such that concepts or processes may not be repeated in some embodiments for the same or similar phases.
Fig. 7 is a schematic flow chart of a voice processing method according to an embodiment of the present invention, and an execution subject of the embodiment may be the controller in the embodiment shown in fig. 6. As shown in fig. 7, the method includes:
s701: and obtaining an image to be identified, and processing the image to be identified to obtain a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters.
In the embodiment of the invention, after the display equipment is started, an instruction for acquiring an image is generated, and the image acquisition device is controlled to shoot a scene graph corresponding to the display equipment according to the instruction for acquiring the image. The image capturing device is illustratively a camera mounted directly above the display device. When the display device is applied to a teleconference, the target scene shot by the camera contains all the people participating in the conference facing the display device, namely the image to be identified shot by the camera contains at least one facial image feature of the people participating in the conference.
In the embodiment of the invention, a face image recognition algorithm and a face ranging algorithm are preset in a controller of display equipment. After obtaining an image to be recognized including facial image features of at least one participant, the image to be recognized is subjected to face recognition processing and face ranging processing, and at least one face image and a distance parameter and a position parameter corresponding to each face image are obtained.
Specifically, by adopting a face image recognition algorithm, recognizing all face images contained in the image to be recognized and the position of each face image, wherein the position of each face image is the position data of the recognized face image in the image to be recognized, namely the offset position parameter from the center of the image to be recognized. In the process of shooting an image to be identified by the camera, the distance between the human eyes and the camera can be determined by adopting a face ranging algorithm according to the focal length of the camera, the width of the human eye pixels identified by the camera and the distance between the two eyes of the human. In an exemplary process of shooting an image to be recognized corresponding to a current conference place, the camera performs a process of adjusting the focal length for a plurality of times according to the recognized face image position so as to measure the distances between all conference participants and the display device.
In the embodiment of the invention, after all face images and the distance parameters and the position parameters corresponding to each face image are obtained, the corresponding angle parameters of each face image can be determined according to the position parameters corresponding to each face image, and then a mapping table is generated according to the distance parameters and the angle parameters corresponding to all face images.
Fig. 8 is a schematic diagram of a position corresponding to a face image according to an embodiment of the present invention. Specifically, the position corresponding to the face image is shown in fig. 8. Illustratively, after processing the image to be identified to obtain position information of all the identified face images in the image to be identified, an angle parameter of each face image is determined. The specific method for calculating the angle parameter is that the center below the image to be identified is used as the position of a dot, the slope of a straight line between the position of the face image and the dot is determined, and the angle parameter of the face image is determined according to the slope. After the angle parameters of all the face images are calculated, a mapping table is generated according to the distance parameters and the angle parameters corresponding to all the face images.
For example, the head portraits of 6 conference participants are identified in the image to be identified, and are respectively identified as No. 1 to No. 6, and then the mapping tables obtained according to the distance parameters and the angle parameters corresponding to No. 1 to No. 6 are shown in table 1:
TABLE 1
No. 1 No. 2 No. 3 No. 4 No. 5 No. 6
Distance parameter 2 meters 3 meters 4 m 4 m 3 meters 2 meters
Angle parameter 45 degrees 60 degrees 85 degrees 95 degrees 120 degrees 135 degree
S702: and obtaining the voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and if a target angle parameter corresponding to the incoming wave angle exists in the mapping table, obtaining a target distance corresponding to the target angle parameter.
In the embodiment of the invention, in the process of providing the teleconference by the display equipment, when the voice acquisition device of the display equipment detects voice, voice audio of the speaking voice of the conference personnel is recorded as the voice to be processed. Illustratively, echo cancellation is performed on the voice to be processed, and the voice to be processed after echo cancellation is analyzed by adopting a beamforming algorithm.
In the embodiment of the invention, a beam forming algorithm, namely an adaptive beam forming algorithm, is preset in the display equipment, and the algorithm can identify the azimuth of the recorded voice. In the embodiment of the invention, the azimuth of the speaker relative to the display equipment in the current conference can be obtained by carrying out the wave beam forming algorithm processing on the voice to be processed, and the incoming wave angle of the speaker of the current conference is obtained according to the azimuth information.
In the embodiment of the invention, if the target angle parameter corresponding to the angle of the incoming wave exists in the mapping table, the target distance corresponding to the target angle parameter is obtained. For example, if an angle parameter consistent with the angle of the incoming wave can be found in the mapping table obtained in S702, it is determined that the person corresponding to the angle parameter consistent with the angle of the incoming wave is the person currently speaking. And taking the angle parameter consistent with the angle of the incoming wave as a target angle parameter, and taking the distance parameter corresponding to the target angle parameter as a target distance.
In the embodiment of the invention, if the target angle parameter corresponding to the incoming wave angle cannot exist in the mapping table, that is, the currently acquired voice may not be the voice sent by the speaker, the target distance is set to be zero. Illustratively, when the target distance is zero, the voice to be recognized is directly silenced.
S703: determining target reverberation parameters corresponding to the target distances, determining target gain parameters corresponding to the target distances, determining target voice corresponding to the voice to be processed according to the target reverberation parameters and the target gain parameters, and sending the target voice to the target terminal.
In the embodiment of the invention, after the target distance between the current speaker and the display device is determined, the audio can be adjusted according to the reverberation parameter and the gain parameter corresponding to the target distance.
For example, the corresponding target reverberation parameter can be determined according to the target distance. Specifically, when the target distance is smaller than the preset minimum distance, the target reverberation parameter is determined to be 1. For example, the preset minimum distance is 3 meters, i.e. the distance from the display device is less than 3 meters for all speakers' voices, the corresponding reverberation coefficient is 1. When the target distance is greater than or equal to the preset minimum distance, determining a target reverberation parameter corresponding to the target distance according to a formula (1):
τ=1+0.01log(10*d) (1)
where τ is the target reverberation parameter and d is the target distance.
For example, the corresponding target gain parameter may be determined according to the target distance, specifically, the target gain parameter corresponding to the target distance is determined according to formula (2):
where θ is the target gain parameter and d is the target distance.
In the embodiment of the present invention, after determining the target reverberation parameter corresponding to the target distance and determining the target gain parameter corresponding to the target distance, the target coefficient is determined according to the product of the target reverberation parameter and the target gain parameter, that is, the target coefficient is determined according to the product of the target reverberation parameter T and the target gain parameter θ.
In the embodiment of the invention, after the echo cancellation processing is carried out on the voice to be processed, the reverberation removal processing and the voice gain processing are carried out on the voice to be processed after the echo cancellation processing according to the target coefficient obtained by calculation, the processed target voice is obtained, and the target voice is sent to the target terminal corresponding to the teleconference, so that the playing effect of the target voice at the other end of the teleconference is improved.
According to the voice processing method provided by the embodiment, the positions of all persons participating in the teleconference and shot by the camera are obtained, the distances between the persons participating in the teleconference and the microphones at all positions and the angles relative to the camera are determined, when the fact that the person speaks is detected, the distances are determined according to the positions of the speakers, intelligent gain and dereverberation processing are carried out on the voices of the speakers according to the distances, and the conversation quality of the teleconference is improved.
Fig. 9 is a second flowchart of a voice processing method according to an embodiment of the present invention. In the embodiment of the present invention, another implementation method for obtaining the image to be identified and processing the image to be identified to obtain the mapping table is described in detail in S701 based on the embodiment provided in fig. 7. As shown in fig. 9, the method includes:
s901: and carrying out face recognition processing and face ranging processing on the images to be recognized to obtain at least one face image and distance parameters and position parameters corresponding to each face image.
S902: and determining the corresponding angle parameter of each face image according to the corresponding position parameter of each face image.
In the embodiment of the present invention, the method and effect achieved by S901 to S902 are identical to the method and effect achieved by S701 in the embodiment of fig. 7, and are not described herein.
S903: at least one angle interval is obtained according to preset angle interval parameters, wherein each angle interval comprises a minimum angle parameter and a maximum angle parameter.
In the embodiment of the invention, a plurality of angle intervals are preset. Illustratively, as shown in the angle interval table 2, at least one angle interval preset in the controller of the display device is included in table 2. Wherein the preset angle interval parameter is 30 degrees.
TABLE 2
S904: and determining an angle interval matched with the angle parameter corresponding to each face image, taking the average value of the distance parameters corresponding to all face images belonging to the same angle interval as the average value distance corresponding to the angle interval, and generating a mapping table according to all the angle intervals and the corresponding average value distance.
For example, if the angle parameters corresponding to the personnel No. 1, the personnel No. 2 and the personnel No. 3 are respectively 10 degrees, 15 degrees and 20 degrees, the angle intervals of the matching of the angles of the personnel No. 1, the personnel No. 2 and the personnel No. 3 of 10 degrees, 15 degrees and 20 degrees are the first angle intervals, and if the distance parameters corresponding to the personnel No. 1, the personnel No. 2 and the personnel No. 3 are respectively 1 meter, 2 meters and 3 meters, the average value of the distance parameters corresponding to the personnel No. 1, the personnel No. 2 and the personnel No. 3 is taken as the average value distance corresponding to the first angle intervals.
After obtaining all face images and the distance parameters and the position parameters corresponding to each face image, obtaining the average value distance corresponding to each angle interval according to the minimum angle parameter and the maximum angle parameter contained in each angle interval in the angle interval table, and generating a new mapping table according to all angle intervals and the corresponding average value distances, as shown in table 3:
TABLE 3 Table 3
According to the voice processing method provided by the embodiment, at least one angle interval is obtained according to the preset angle interval parameter, the angle interval matched with the angle parameter corresponding to each face image is determined, the average value of the distance parameters corresponding to all face images belonging to the same angle interval is used as the average value distance corresponding to the angle interval, and the mapping table is generated according to all the angle intervals and the corresponding average value distances, so that the corresponding target distance can be determined directly according to the mapping table after the incoming wave direction corresponding to the voice to be recognized is obtained, the data processing flow for determining the target distance is reduced, and the voice processing efficiency is improved.
Fig. 10 is a flowchart illustrating a voice processing method according to an embodiment of the present invention. In the embodiment of the present invention, based on the embodiment provided in fig. 9, another method for processing voice provided after generating the mapping table according to all the angle intervals and the corresponding average value distances in S905 is described in detail. As shown in fig. 10, the method includes:
S1001: and obtaining an image to be identified, and generating a mapping table according to the image to be identified, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters.
S1001: and obtaining the voice to be processed, carrying out beamforming algorithm processing on the voice to be processed, and determining the incoming wave angle corresponding to the voice to be processed.
In the embodiment of the present invention, the methods and effects implemented in S1001 to S1002 are identical to the methods and effects implemented in S701 to S702 in the embodiment of fig. 7, and are not described herein.
S1003: if the target angle interval which is consistent with the target angle parameter is determined to exist in the mapping table, determining the average value distance corresponding to the target angle interval as the target distance corresponding to the target angle parameter, wherein the target angle parameter is larger than or equal to the minimum angle parameter corresponding to the target angle interval, and the target angle parameter is smaller than or equal to the maximum angle parameter corresponding to the target angle interval.
In the implementation of the present invention, a certain deviation may exist due to the incoming wave angle corresponding to the voice to be processed determined in S1001. In the process of determining the target distance by using table 1 as the mapping table, there may be a problem that the target angle parameter corresponding to the incoming wave angle cannot be queried in the mapping table of table 1, that is, there may be a problem that the target distance corresponding to the target angle parameter cannot be obtained according to table 1. Therefore, table 3 may be used as a mapping table, that is, when it is determined that there is a target angle interval in which the target angle parameter corresponds to the target angle interval in the mapping table, the average distance corresponding to the target angle interval is determined as the target distance corresponding to the target angle parameter. Specifically, the target angle parameter is greater than or equal to the minimum angle parameter corresponding to the target angle interval, and the target angle parameter is less than or equal to the maximum angle parameter corresponding to the target angle interval.
For example, when the angle parameter corresponding to the person No. 4 is 75 degrees, it may be determined that the angle interval in table 3 matched with the angle parameter corresponding to the person No. 4 is the third angle interval, and then the average distance 3 m of the third angle interval is determined as the target distance corresponding to the person No. 4.
S1004: determining target reverberation parameters corresponding to the target distances, determining target gain parameters corresponding to the target distances, determining target voice corresponding to the voice to be processed according to the target reverberation parameters and the target gain parameters, and sending the target voice to the target terminal.
In the embodiment of the present invention, the method and effect implemented in S1004 are identical to the method and effect implemented in S703 in the embodiment of fig. 7, and are not described herein.
According to the voice processing method, the target distance corresponding to the incoming wave direction is determined according to the mapping table comprising the plurality of angle intervals, the situation that the target distance corresponding to the incoming wave direction cannot be obtained due to the error of the incoming wave direction is avoided, the method for determining the target distance according to the target angle parameters is improved, and the accuracy of voice processing is improved.
Fig. 11 is a schematic structural diagram of a speech processing device according to an embodiment of the present invention. As shown in fig. 11, the voice processing apparatus includes: the obtaining module 1101, the processing module 1102 and the determining module 1103.
The obtaining module 1101 is configured to obtain an image to be identified, and process the image to be identified to obtain a mapping table, where the mapping table includes at least one set of distance parameters and corresponding angle parameters.
The processing module 1102 is configured to obtain a voice to be processed, perform beamforming algorithm processing on the voice to be processed, determine an incoming wave angle corresponding to the voice to be processed, and obtain a target distance corresponding to the target angle parameter if there is the target angle parameter corresponding to the incoming wave angle in the mapping table.
The determining module 1103 is configured to determine a target reverberation parameter corresponding to the target distance, determine a target gain parameter corresponding to the target distance, determine a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and send the target voice to a target terminal.
In one possible design, the determining module 1103 is specifically configured to determine that the target reverberation parameter is 1 if the target distance is less than a preset minimum distance; if the target distance is greater than or equal to the preset minimum distance, determining a target reverberation parameter corresponding to the target distance according to the following formula:
τ=1+0.01log(10*d)
Where τ is the target reverberation parameter and d is the target distance.
In one possible design, the determining module 1103 is specifically configured to determine the target gain parameter corresponding to the target distance according to the following formula:
where θ is the target gain parameter and d is the target distance.
In one possible design, the determining module 1103 is specifically configured to determine a target coefficient according to a product of the target reverberation parameter and the target gain parameter; and obtaining target voice according to the target coefficient and the voice to be processed.
In one possible design, the obtaining module 1101 is specifically configured to perform face recognition processing and face ranging processing on the image to be recognized, and obtain at least one face image, and a distance parameter and a position parameter corresponding to each face image; determining corresponding angle parameters of each face image according to the corresponding position parameters of each face image; and generating a mapping table according to the distance parameters and the angle parameters corresponding to all the face images.
In one possible design, the obtaining module 1101 is specifically configured to obtain at least one angle interval according to a preset angle interval parameter, where each angle interval includes a minimum angle parameter and a maximum angle parameter; and determining an angle interval matched with the angle parameter corresponding to each face image, taking the average value of the distance parameters corresponding to all face images belonging to the same angle interval as the average value distance corresponding to the angle interval, and generating a mapping table according to all the angle intervals and the corresponding average value distance.
In one possible design, the processing module 1102 is specifically configured to determine, if it is determined in the mapping table that there is a target angle interval that the target angle parameter meets, a mean distance corresponding to the target angle interval as a target distance corresponding to the target angle parameter, where the target angle parameter is greater than or equal to a minimum angle parameter corresponding to the target angle interval, and the target angle parameter is less than or equal to a maximum angle parameter corresponding to the target angle interval.
The device provided in this embodiment may be used to implement the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A display device, characterized by comprising:
the image acquisition device is used for acquiring an image to be identified of the target scene;
the voice acquisition device is used for acquiring the voice to be processed of the target person;
a controller configured to:
obtaining an image to be identified, and processing the image to be identified to obtain a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
obtaining voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and if a target angle parameter corresponding to the incoming wave angle exists in the mapping table, obtaining a target distance corresponding to the target angle parameter;
Determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and transmitting the target voice to a target terminal.
2. The display device of claim 1, wherein the controller is configured, when performing the determining the target reverberation parameter corresponding to the target distance, to:
if the target distance is smaller than a preset minimum distance, determining that the target reverberation parameter is 1;
if the target distance is greater than or equal to the preset minimum distance, determining a target reverberation parameter corresponding to the target distance according to the following formula:
τ=1+0.01log(10*d)
where τ is the target reverberation parameter and d is the target distance.
3. The display device according to claim 1, wherein the controller is configured, when performing the determining the target gain parameter corresponding to the target distance, to:
determining a target gain parameter corresponding to the target distance according to the following formula:
where θ is the target gain parameter and d is the target distance.
4. The display device according to claim 1, wherein the controller is configured, when executing the determining the target speech corresponding to the speech to be processed according to the target reverberation parameter and the target gain parameter, to:
determining a target coefficient according to the product of the target reverberation parameter and the target gain parameter;
and obtaining target voice according to the target coefficient and the voice to be processed.
5. The display device according to claim 1, wherein the controller is configured, when executing the processing of the map to be identified to obtain a mapping table, to:
performing face recognition processing and face ranging processing on the image to be recognized to obtain at least one face image and distance parameters and position parameters corresponding to each face image;
determining corresponding angle parameters of each face image according to the corresponding position parameters of each face image;
and generating a mapping table according to the distance parameters and the angle parameters corresponding to all the face images.
6. The display device of claim 5, wherein the controller is configured, after performing the determining the corresponding angle parameter for each face image based on the corresponding position parameter for each face image, to further:
Obtaining at least one angle interval according to preset angle interval parameters, wherein each angle interval comprises a minimum angle parameter and a maximum angle parameter;
and determining an angle interval matched with the angle parameter corresponding to each face image, taking the average value of the distance parameters corresponding to all face images belonging to the same angle interval as the average value distance corresponding to the angle interval, and generating a mapping table according to all the angle intervals and the corresponding average value distance.
7. The display device of claim 6, wherein the controller is configured, when executing the target angle parameter corresponding to the incoming wave angle if present in the mapping table, to obtain a target distance corresponding to the target angle parameter, to further:
if the target angle interval which accords with the target angle parameter is determined to exist in the mapping table, determining the average distance corresponding to the target angle interval as the target distance corresponding to the target angle parameter, wherein the target angle parameter is larger than or equal to the minimum angle parameter corresponding to the target angle interval, and the target angle parameter is smaller than or equal to the maximum angle parameter corresponding to the target angle interval.
8. A display device as claimed in any one of claims 1 to 7, characterized in that the speech acquisition means is a microphone array comprising at least two microphones.
9. A voice processing method, characterized by being applied to a controller of a display device, comprising:
obtaining an image to be identified, and processing the image to be identified to obtain a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
obtaining voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and if a target angle parameter corresponding to the incoming wave angle exists in the mapping table, obtaining a target distance corresponding to the target angle parameter;
determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and transmitting the target voice to a target terminal.
10. A speech processing apparatus, comprising:
the acquisition module is used for acquiring an image to be identified, and processing the image to be identified to acquire a mapping table, wherein the mapping table comprises at least one group of distance parameters and corresponding angle parameters;
The processing module is used for obtaining the voice to be processed, carrying out beam forming algorithm processing on the voice to be processed, determining an incoming wave angle corresponding to the voice to be processed, and obtaining a target distance corresponding to a target angle parameter if the target angle parameter corresponding to the incoming wave angle exists in the mapping table;
the determining module is used for determining a target reverberation parameter corresponding to the target distance, determining a target gain parameter corresponding to the target distance, determining a target voice corresponding to the voice to be processed according to the target reverberation parameter and the target gain parameter, and sending the target voice to a target terminal.
CN202210753751.5A 2022-06-29 2022-06-29 Display device, voice processing method and device Pending CN117351976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210753751.5A CN117351976A (en) 2022-06-29 2022-06-29 Display device, voice processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210753751.5A CN117351976A (en) 2022-06-29 2022-06-29 Display device, voice processing method and device

Publications (1)

Publication Number Publication Date
CN117351976A true CN117351976A (en) 2024-01-05

Family

ID=89365588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210753751.5A Pending CN117351976A (en) 2022-06-29 2022-06-29 Display device, voice processing method and device

Country Status (1)

Country Link
CN (1) CN117351976A (en)

Similar Documents

Publication Publication Date Title
CN111641794B (en) Sound signal acquisition method and electronic equipment
CN110333837B (en) Conference system, communication method and device
CN113676592B (en) Recording method, recording device, electronic equipment and computer readable medium
CN112069863B (en) Face feature validity determination method and electronic equipment
CN111091845A (en) Audio processing method and device, terminal equipment and computer storage medium
CN116097120A (en) Display method and display device
JP6108925B2 (en) Imaging device, focus adjustment system, focus instruction device, focus adjustment method, and program
CN109032554A (en) A kind of audio-frequency processing method and electronic equipment
CN112839165B (en) Method and device for realizing face tracking camera shooting, computer equipment and storage medium
US11227423B2 (en) Image and sound pickup device, sound pickup control system, method of controlling image and sound pickup device, and method of controlling sound pickup control system
US20110043598A1 (en) Remote communication apparatus and method of estimating a distance between an imaging device and a user image-captured
CN117351976A (en) Display device, voice processing method and device
US11665391B2 (en) Signal processing device and signal processing system
CN111093028A (en) Information processing method and electronic equipment
JP5151131B2 (en) Video conferencing equipment
EP4075794A1 (en) Region of interest based adjustment of camera parameters in a teleconferencing environment
US11875800B2 (en) Talker prediction method, talker prediction device, and communication system
US11232796B2 (en) Voice activity detection using audio and visual analysis
CN113824916A (en) Image display method, device, equipment and storage medium
JP2021197658A (en) Sound collecting device, sound collecting system, and sound collecting method
CN110730378A (en) Information processing method and system
JP5529617B2 (en) Remote conference apparatus, remote conference method, and remote conference program
JP2004112511A (en) Display controller and method therefor
US11937057B2 (en) Face detection guided sound source localization pan angle post processing for smart camera talker tracking and framing
CN111432155B (en) Video call method, electronic device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination