CN112261236B

CN112261236B - Method and equipment for mute processing in multi-person voice

Info

Publication number: CN112261236B
Application number: CN202011051895.3A
Authority: CN
Inventors: 程翰
Original assignee: Shanghai Lianshang Network Technology Co Ltd
Current assignee: Shanghai Lianshang Network Technology Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-02-15
Anticipated expiration: 2040-09-29
Also published as: CN112261236A

Abstract

The application aims to provide a method and equipment for mute processing in multi-person voice, wherein the method comprises the following steps: acquiring current face position information of a first user in a multi-user voice communication process; determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment or not according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing. Compared with a manual mute mode of a user, the method and the device have the advantages that the user does not need to manually execute any operation, great convenience can be provided for the user, user experience can be enhanced, the call quality of multi-person voice is improved, and interference in multi-person voice call is reduced.

Description

Method and equipment for mute processing in multi-person voice

Technical Field

The present application relates to the field of communications, and more particularly, to a technique for mute processing in multi-person speech.

Background

With the development of the times, multi-user voice communication has been widely applied, for example, multi-user voice communication has been widely applied to various fields such as remote conferences, online command of game teams, online singing, live broadcasting and the like. At present, in the process of multi-user voice communication, when a user has a mute requirement, the user can manually click a preset button to realize the mute purpose.

Disclosure of Invention

An object of the present application is to provide a method and apparatus for silence processing in multi-person speech.

According to an aspect of the present application, there is provided a method for silence processing in a multi-person voice, the method including:

acquiring current face position information of a first user in a multi-user voice communication process;

determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment or not according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing.

According to an aspect of the present application, there is provided a user equipment for muting processing in a multi-person voice, the apparatus including:

the one-to-one module 11 is used for acquiring the current face position information of the first user in the conversation process of the multi-person voice;

a second module 12, configured to determine whether muting processing needs to be performed on subsequent input voice information of the user equipment according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing.

According to an aspect of the present application, there is provided an apparatus for mute processing in a multi-person voice, wherein the apparatus comprises:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

According to one aspect of the application, there is provided a computer-readable medium storing instructions that, when executed, cause a system to:

Compared with the prior art, the method and the device have the advantages that the current face position information of the user is obtained, whether muting processing needs to be carried out on subsequent input voice information of the user equipment can be determined according to the current face position information, so that muting processing is carried out on the input voice information of the user equipment automatically in real time, interference of irrelevant voice can be avoided, compared with a manual muting mode of the user, any operation does not need to be carried out manually by the user, great convenience can be provided for the user, user experience can be enhanced, the voice call quality of multiple persons is improved, and interference in voice call of multiple persons is reduced.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method for silence handling in multi-person speech according to one embodiment of the present application;

FIG. 2 illustrates a user equipment structure diagram for muting in multi-person speech according to one embodiment of the present application;

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.

The Memory may include forms of volatile Memory, Random Access Memory (RAM), and/or non-volatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory. Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change Memory (PCM), Programmable Random Access Memory (PRAM), Static Random-Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The device referred to in this application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product, such as a smart phone, a tablet computer, etc., capable of performing human-computer interaction with a user (e.g., human-computer interaction through a touch panel), and the mobile electronic product may employ any operating system, such as an Android operating system, an iOS operating system, etc. The network Device includes an electronic Device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded Device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, a plurality of network server sets or a cloud of a plurality of servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the device may also be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the foregoing is by way of example only, and that other existing or future devices, which may be suitable for use in the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.

In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Fig. 1 shows a flowchart of a method for mute processing in multi-person speech according to an embodiment of the present application, the method including step S11 and step S12. In step S11, the user equipment acquires current face position information of the first user during a call of the multi-person voice; in step S12, the user equipment determines whether muting processing needs to be performed on subsequent input voice information of the user equipment according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing.

In step S11, the user equipment acquires the current face position information of the first user during the call of the multi-person voice. In some embodiments, the multi-person voice may be a multi-person voice call, a multi-person voice online conference, or the like, or may be a multi-person video call, a multi-person video online conference, or the like. In some embodiments, the current face position information of the first user may be current orientation information of the first user's face relative to a screen or microphone on the user device, e.g., the current face is facing forward on the screen, the current face is 30 degrees off the screen clockwise, and the current face is 45 degrees off the screen counterclockwise. In some embodiments, the current face position information of the first user may also be current distance information and/or current direction information of the first user's face relative to a screen or microphone on the user device, e.g., a position of the current face 10 centimeters directly in front of the microphone, a position of the current face 30 centimeters in a direction 50 degrees in front of the right of the microphone, a position of the current face 20 centimeters in a direction 40 degrees in front of the left of the microphone. In some embodiments, the current face position information of the first user may be obtained in real time, for example, the current face position information of the first user may be obtained from video stream information containing the face of the first user, which is collected in real time by a camera in the user device. In some embodiments, the current face position information of the first user may also be acquired at predetermined time intervals, for example, the current face position information of the first user may be acquired from current photo information containing the face of the first user, which is taken by a camera in the user device at predetermined time intervals (for example, 1 second). In some embodiments, the current face position information of the first user may also be obtained whenever a predetermined trigger condition is satisfied, for example, the current face position information of the first user may be determined by performing sound source localization on input voice information of the first user whenever the input voice information is collected by a microphone.

In step S12, the user equipment determines whether muting processing needs to be performed on subsequent input voice information of the user equipment according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing. In some embodiments, whether or not muting of subsequent input speech information of the user device is required may be determined based on whether or not the current face orientation information satisfies a predetermined deviation angle threshold (e.g., the deviation angle threshold is 30 degrees, the face is facing forward on the screen or the microphone is 0 degrees, and if the current face orientation information includes that the current face is 45 degrees off the screen counterclockwise and is greater than the deviation angle threshold 30 degrees, it is determined that muting of subsequent input speech information is required). In some embodiments, whether or not to mute the subsequent input voice information of the user equipment may also be determined based on whether or not the current face distance information is greater than or equal to a predetermined distance threshold (e.g., the distance threshold is 15 centimeters, the screen or microphone is located at 0 centimeters, and if the current face is located 20 centimeters away from the screen, greater than 15 centimeters, it is determined that muting of the subsequent input voice information is required). In some embodiments, whether or not the subsequent input voice information of the user equipment needs to be muted may be determined based on whether or not the current face direction information satisfies a predetermined direction angle threshold (e.g., the direction angle threshold is 45 degrees, and 0 degrees directly in front of the screen or the microphone, and if the current face is in a direction of 50 degrees right in front of the screen, which is greater than 45 degrees, it is determined that the subsequent input voice information needs to be muted). In some embodiments, it may be determined comprehensively according to the three or two of the three that whether or not muting processing is required for the subsequent input voice information of the user equipment is required, for example, if the current face orientation information includes that the current face is deviated from the screen by 45 degrees counterclockwise and the current face is located at a position 20 centimeters away from the screen in a direction 50 degrees forward right of the screen, it may be determined that muting processing is required for the subsequent input voice information of the user equipment according to the deviation of the screen by 45 degrees greater than a predetermined deviation angle threshold (e.g., 30 degrees) and the deviation of the screen by 20 centimeters greater than a predetermined distance threshold (e.g., 15 centimeters) and the deviation of the screen by 50 degrees forward right of the screen by greater than a predetermined direction angle threshold (e.g., 45 degrees). In some embodiments, if the multi-person voice is a multi-person voice call, a multi-person voice online conference, or the like, the subsequently input voice information may be voice information in the subsequently input video stream. In some embodiments, if the multi-user voice is a multi-user voice call, a multi-user voice online conference, etc., muting the subsequent input voice information includes, but is not limited to, stopping sending the subsequent input voice information to other users or servers in the multi-user voice, reducing the volume of the subsequent input voice information, stopping collecting the subsequent input voice information by a microphone, if the multi-user voice is a multi-user video call, a multi-user video online conference, etc., muting the subsequent input voice information includes, but is not limited to, sending subsequent input video stream information not containing audio information only to other users or servers in the multi-user voice, reducing the audio volume in the subsequent input video stream information, collecting subsequent input video stream information not containing audio information only by a camera, so that the other users in the multi-user voice can not hear the voice of the first user, alternatively, the other users are made to see only the first user's video stream that does not contain audio information. According to the method and the device, through acquiring the current face position information of the user, whether muting processing needs to be carried out on the subsequent input voice information of the user equipment can be determined according to the current face position information, so that muting processing is carried out on the input voice information of the user equipment automatically in real time, interference of irrelevant voice can be avoided, compared with a manual muting mode of the user, any operation is not required to be manually carried out by the user, great convenience can be provided for the user, user experience can be enhanced, the call quality of multi-user voice is improved, and interference in multi-user voice call is reduced.

In some embodiments, the current face position information comprises current face orientation information and/or current face direction information; wherein the step S11 includes a step S13 (not shown). In step S13, during a call of a voice of multiple persons, the user equipment determines current face orientation information and/or current face direction information of the first user according to the first image information collected by the camera device in the user equipment. In some embodiments, the current face orientation information is current orientation information of the first user's face relative to a screen on the user device, e.g., the current face is facing forward on the screen, the current face is 30 degrees off the screen clockwise. In some embodiments, the current face direction information is current direction information of the first user's face relative to a screen or microphone on the user device, e.g., a direction with the current face directly in front of the microphone and the current face 50 degrees in front of the microphone. In some embodiments, the first image information may be video stream information containing the first user's face captured by a camera on the user device in real time, or the first image information may also be current photo information containing the first user's face taken by the camera on the user device at predetermined time intervals (e.g., 1 second). In some embodiments, during a call of a multi-person voice, current face orientation information and/or current face direction information of a first user may be acquired from first image information containing a first user face collected by a camera on a user device. In some embodiments, if the first image information does not include the face of the first user, which indicates that the face of the first user is not in the vicinity of the user device, at this time, it may be determined that the current face position information of the first user is "the current face is far from the screen", and it is directly determined that the subsequent input voice information of the user device needs to be muted.

In some embodiments, the current face position information further includes current face distance information, the camera device being a depth camera device; wherein the step S13 includes: and in the communication process of the multi-person voice, the user equipment determines the current face orientation information and/or the current face direction information and/or the current face distance information of the first user according to the first image information collected by the depth camera device. In some embodiments, the current face position information is current distance information of the first user's face relative to a screen on the user device, e.g., the current face is 10 centimeters from the screen. In some embodiments, the camera on the user equipment is a depth camera, and in the call process of the multi-user voice, the current face distance information of the first user can be additionally acquired from the first image information which is collected by the depth camera on the user equipment and contains the face of the first user, compared with a common camera.

In some embodiments, wherein the step S13 includes step S14 (not shown). In step S14, during a call of a voice of multiple persons, a user equipment identifies a human face object from first image information acquired by a camera in the user equipment; and determining current face orientation information and/or current face direction information of the first user according to the recognition result. In some embodiments, during a call of a multi-user voice, a face object needs to be recognized in first image information which is collected by a camera on user equipment and contains a first user face. In some embodiments, the current face orientation information of the first user is determined according to the face feature information of the identified face object on the first image, for example, different face feature information corresponds to the face object with different face orientation information. In some embodiments, the current face direction information of the first user is determined according to the position information of the identified face object in the first image, for example, the position information of the face object in different directions relative to the camera in the first image information is different.

In some embodiments, the face object is a face object of the first user; wherein the step S14 includes: and the user equipment identifies the face object of the first user from the first image information acquired by the camera device according to the pre-acquired face feature information of the first user in the call process of the multi-user voice. In some embodiments, if the first image information only includes the face object of the first user, according to the pre-acquired face feature information of the first user, the recognition speed of recognizing the face object of the first user from the first image information including the face of the first user, which is acquired by a camera on the user equipment, may be increased, and the recognition efficiency may be increased. In some embodiments, if the first image information includes a plurality of face objects, the face object of the first user needs to be accurately identified from the first image information including the plurality of face objects, which is acquired by a camera on the user equipment, according to the pre-acquired face feature information of the first user.

In some embodiments, the current face position information comprises current face distance information and/or face direction information; wherein the step S11 includes a step S15 (not shown). In step S15, during a call of a multi-user voice, the user equipment determines current face distance information and/or current face direction information of the first user by performing sound source localization on input voice information corresponding to the user equipment. In some embodiments, during a call of a multi-user voice, by performing sound source localization on input voice information corresponding to a user device or audio information in video stream information corresponding to the user device, position information of a first user face relative to a microphone may be obtained, where the position information includes distance information of the first user face relative to the microphone, that is, current face distance information, and direction information of the first user face relative to the microphone, that is, current face direction information.

In some embodiments, the step S15 includes a step S16 (not shown) and a step S17 (not shown). In step S16, during a call of a multi-user voice, the user equipment identifies voice information included in input voice information corresponding to the user equipment; in step S17, the user equipment determines the current face distance information and/or current face direction information of the first user by performing sound source localization on the recognized human voice information. In some embodiments, it is necessary to first identify voice information contained in input voice information corresponding to the user equipment or audio information in video stream information corresponding to the user equipment, and then perform sound source localization on the voice information to obtain current face distance information of the first user face relative to the microphone and current face direction information of the first user face relative to the microphone.

In some embodiments, the vocal information is voice information of the first user; wherein the step S16 includes: in the process of a multi-user voice call, user equipment identifies and obtains voice information of a first user from input voice information corresponding to the user equipment according to pre-acquired voiceprint characteristic information of the first user; wherein the step S17 includes: and the user equipment determines the current face distance information and/or the current face direction information of the first user by carrying out sound source positioning on the sound information of the first user. In some embodiments, if the voice information only includes the voice information of the first user, according to the pre-acquired voiceprint feature information of the first user, the recognition speed of recognizing the voice information of the first user from the input voice information corresponding to the user equipment or the audio information in the video stream information corresponding to the user equipment may be increased, and the recognition efficiency may be increased. In some embodiments, if the voice information includes voice information of multiple users, the voice information of the first user needs to be accurately identified from the input voice information corresponding to the user equipment or the audio information in the video stream information corresponding to the user equipment according to the pre-acquired voiceprint feature information of the first user.

In some embodiments, the determining whether muting processing of subsequent input speech information of the user equipment is required according to the current face position information includes: determining the holding duration information of the current face position information according to the current face position information and by combining historical face position information; and if the holding duration information is greater than or equal to a preset duration threshold, determining whether to mute the subsequent input voice information of the user equipment or not according to the current face position information. In some embodiments, based on the current face position information and each previous historical face position information and its corresponding acquisition time, the holding duration information corresponding to the current face position information may be obtained, for example, the acquisition time corresponding to the historical face position information L1 is 1 minute before, the acquisition time corresponding to the historical face position information L2 is 2 minutes before, and the acquisition time corresponding to the historical face position information L3 is 3 minutes before, where the historical face position information L1 and the historical face position information L2 are the same as the current face position information, so that the holding duration information corresponding to the current face position information may be determined to be "2 minutes". In some embodiments, if the holding duration information (e.g., 2 minutes) corresponding to the current face position information is greater than or equal to a predetermined duration threshold (e.g., 30 seconds), it is determined whether muting processing is required for subsequent input voice information of the user equipment according to the current face position information, so that whether a temporary face position change or a long-time face position change is identified, for example, whether a temporary turn or a long-time turn is identified.

In some embodiments, wherein the current face position information comprises current face orientation information and/or current face direction information and/or the current face distance information; wherein the determining whether to mute the subsequent input voice information of the user equipment according to the current face position information includes at least one of: determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face orientation information meets a preset deviation angle threshold value; determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face direction information meets a preset direction angle threshold value; and determining whether the subsequent input voice information of the user equipment needs to be subjected to mute processing or not according to whether the current face distance information is larger than or equal to a preset distance threshold or not. In some embodiments, whether muting of subsequent input speech information of the user device is required may be determined based on whether the current face orientation information satisfies a predetermined deviation angle threshold, for example, the deviation angle threshold is 30 degrees, the face is facing forward at 0 degrees to the screen or the microphone, and if the current face orientation information includes that the current face is 45 degrees away from the screen counterclockwise and is greater than the deviation angle threshold by 30 degrees, muting of subsequent input speech information is determined to be required. In some embodiments, whether muting of subsequent input voice information of the user device is required may be determined based on whether the current face distance information is greater than or equal to a predetermined distance threshold, for example, the distance threshold is 15 centimeters and the position of the screen or microphone is 0 centimeters, and if the current face is 20 centimeters away from the screen and greater than 15 centimeters, muting of subsequent input voice information is determined to be required. In some embodiments, whether or not muting of the subsequent input voice information of the user equipment is required may be determined according to whether or not the current face direction information satisfies a predetermined direction angle threshold, for example, the direction angle threshold is 45 degrees, the direction right in front of the screen or the microphone is 0 degrees, and if the current face is in the direction of 50 degrees right in front of the screen and greater than 45 degrees, it is determined that muting of the subsequent input voice information is required. In some embodiments, whether muting processing is required for the subsequent input voice information of the user equipment may be determined according to one of the three, or may be determined comprehensively according to two of the three, or may be determined comprehensively according to the three.

In some embodiments, the determining whether or not the mute processing of the subsequent input voice information of the user equipment is required according to the current face position information includes steps S18 (not shown) and S19 (not shown). In step S18, the user device determines the face movement related information of the first user according to the current face position information and in combination with the historical face position information of the first user; in step S19, the user equipment determines whether or not muting processing is required for subsequent input speech information of the user equipment based on the face movement related information. In some embodiments, based on the current face position information, and each previous historical face position information and its corresponding acquisition time, face movement related information of the first user, including but not limited to face movement direction information, face movement offset information, face movement speed information, may be obtained, and then it may be determined whether or not muting processing of subsequent input voice information of the user device is required based on the face movement related information of the first user. In some embodiments, the face movement direction information is used to indicate a movement direction of the face of the first user, wherein the face movement of the first user may be a face movement of the first user caused by a body movement of the first user, or the face movement of the first user may also be a face rotation of the first user, for example, the face of the first user moves away from the screen towards the front of the screen, or the face of the first user rotates towards a direction along the direction of the pointer and close to the screen, and whether the face movement direction of the first user is away from the screen or the microphone is determined according to the face movement direction information of the first user; if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, it is determined that the subsequent input voice information of the user equipment does not need to be muted. In some embodiments, the face movement offset information is used to indicate face offset distance information or face offset angle information for the current face position compared to the last recently acquired historical face position, e.g., the current face position is offset by 10 centimeters compared to the last recently acquired historical face position, or the current face position is offset by 10 degrees compared to the last recently acquired historical face position, depending on whether the face offset distance information or face offset angle information is greater than or equal to predetermined offset threshold information (e.g., offset distance threshold information of 5 centimeters, or offset angle threshold information, 5 degrees); if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, the maximum probability is a facial movement caused by the first user's natural body motion (e.g., normal shaking, lazy stretching, low-head tapping, etc.), thereby determining that no subsequent input voice information of the user device needs to be muted. In some embodiments, the face movement speed information is indicative of a current movement speed of the first user's face, determining whether the first user's face movement speed is greater than or equal to a predetermined speed threshold based on the first user's face movement speed information; if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, the maximum probability is a facial movement caused by the natural body motion of the first user, thereby determining that no muting of subsequent input speech information of the user device is required. In some embodiments, whether muting processing is required for the subsequent input voice information of the user equipment may be determined according to one of the face movement direction information, the face movement offset information, and the face movement speed information, or whether muting processing is required for the subsequent input voice information of the user equipment may be determined comprehensively according to two of the above three information, or whether muting processing is required for the subsequent input voice information of the user equipment may be determined comprehensively according to the above three information.

In some embodiments, the face movement related information comprises face movement direction information; wherein the step S19 includes a step S20 (not shown). In step S20, the user device determines whether the face movement direction of the first user is a direction away from the user device according to the face movement direction information; if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. In some embodiments, the face movement direction information is used to indicate a movement direction of the face of the first user, wherein the face movement of the first user may be a face movement of the first user caused by a body movement of the first user, or the face movement of the first user may also be a face rotation of the first user, for example, the face of the first user moves away from the screen towards the front of the screen, or the face of the first user rotates towards a direction along the direction of the pointer and close to the screen, and whether the face movement direction of the first user is away from the screen or the microphone is determined according to the face movement direction information of the first user; if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, it is determined that the subsequent input voice information of the user equipment does not need to be muted.

In some embodiments, the facial motion related information further comprises facial motion offset information; wherein the step S20 includes: the user equipment determines whether the face movement direction of the first user is a direction far away from the user equipment according to the face movement direction information and determines whether the offset information is larger than or equal to the preset offset threshold information according to the face movement; if the face movement direction is a direction far away from the user equipment and the face movement offset information is greater than or equal to the preset offset threshold information, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. In some embodiments, the face motion offset information is used to indicate face offset distance information or face offset angle information for the current face position compared to the last recently acquired historical face position, e.g., the current face position is offset by 10 centimeters from the last recently acquired historical face position, alternatively, the current face position is shifted by 10 degrees from the last recently acquired historical face position, according to whether the face offset distance information or the face offset angle information is greater than or equal to predetermined offset threshold information (for example, the offset distance threshold information is 5 cm, or the offset angle threshold information is 5 degrees, if the face movement offset information is less than the predetermined offset threshold information, the maximum probability is that the face movement is caused by the natural body movement of the first user (for example, normal shake, stretching, bowing, or the like)) and whether the face movement direction is a direction away from the user device; if the input voice information meets the preset requirement, determining that the subsequent input voice information of the user equipment needs to be subjected to mute processing; otherwise, it is determined that the subsequent input voice information of the user equipment does not need to be muted.

In some embodiments, the face movement related information further comprises face movement speed information; wherein the step S20 includes: the user equipment determines whether the face movement direction of the first user is a direction far away from the user equipment according to the face movement direction information and determines whether the face movement speed of the first user is larger than or equal to a preset speed threshold value according to the face movement speed information; if the face movement direction is a direction far away from the user equipment and the face movement speed is greater than or equal to a preset speed threshold value, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. In some embodiments, the face movement speed information is used to indicate a current movement speed of the face of the first user, and it is determined from the face movement speed information of the first user whether the face movement speed of the first user is greater than or equal to a predetermined speed threshold (if the face movement speed information is less than the predetermined speed threshold information, the maximum probability is a face movement caused by a natural body action of the first user (e.g., normal shaking, stretching, tapping, etc.)) and whether the face movement direction is a direction away from the user device; if the input voice information meets the preset requirement, determining that the subsequent input voice information of the user equipment needs to be subjected to mute processing; otherwise, it is determined that the subsequent input voice information of the user equipment does not need to be muted.

In some embodiments, the muting the subsequent input voice information includes, but is not limited to:

1) stopping sending the subsequent input voice information to other users in the multi-person voice

In some embodiments, if the multi-person voice is a multi-person voice call, a multi-person voice online conference, or the like, the sending of subsequent input voice information to other users or servers in the multi-person voice may be stopped, and if the multi-person voice is a multi-person video call, a multi-person video online conference, or the like, the sending of subsequent input video stream information that does not include audio information may be only sent to other users or servers in the multi-person voice.

2) Reducing the volume of the subsequent input voice information

In some embodiments, if the multi-user voice is a multi-user voice call, a multi-user voice online conference, etc., the volume of the subsequent input voice information may be reduced, and if the multi-user voice is a multi-user video call, a multi-user video online conference, etc., the volume of the audio in the subsequent input video stream information may be reduced.

3) Stopping collecting the subsequent input voice information

In some embodiments, if the multi-person voice is a multi-person voice call, a multi-person voice online conference, or the like, the microphone may stop acquiring subsequent input voice information, and if the multi-person voice is a multi-person video call, a multi-person video online conference, or the like, the camera may acquire subsequent input video stream information that does not include audio information.

4) Any combination of the above mute processing modes

Fig. 2 shows a block diagram of a user equipment for muting in multi-person speech according to an embodiment of the present application, the user equipment comprising a one-module 11 and a two-module 12. The one-to-one module 11 is used for acquiring the current face position information of the first user in the conversation process of the multi-person voice; a second module 12, configured to determine whether muting processing needs to be performed on subsequent input voice information of the user equipment according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing.

And the one-to-one module 11 is used for acquiring the current face position information of the first user in the call process of the multi-person voice. In some embodiments, the multi-person voice may be a multi-person voice call, a multi-person voice online conference, or the like, or may be a multi-person video call, a multi-person video online conference, or the like. In some embodiments, the current face position information of the first user may be current orientation information of the first user's face relative to a screen or microphone on the user device, e.g., the current face is facing forward on the screen, the current face is 30 degrees off the screen clockwise, and the current face is 45 degrees off the screen counterclockwise. In some embodiments, the current face position information of the first user may also be current distance information and/or current direction information of the first user's face relative to a screen or microphone on the user device, e.g., a position of the current face 10 centimeters directly in front of the microphone, a position of the current face 30 centimeters in a direction 50 degrees in front of the right of the microphone, a position of the current face 20 centimeters in a direction 40 degrees in front of the left of the microphone. In some embodiments, the current face position information of the first user may be obtained in real time, for example, the current face position information of the first user may be obtained from video stream information containing the face of the first user, which is collected in real time by a camera in the user device. In some embodiments, the current face position information of the first user may also be acquired at predetermined time intervals, for example, the current face position information of the first user may be acquired from current photo information containing the face of the first user, which is taken by a camera in the user device at predetermined time intervals (for example, 1 second). In some embodiments, the current face position information of the first user may also be obtained whenever a predetermined trigger condition is satisfied, for example, the current face position information of the first user may be determined by performing sound source localization on input voice information of the first user whenever the input voice information is collected by a microphone.

A second module 12, configured to determine whether muting processing needs to be performed on subsequent input voice information of the user equipment according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing. In some embodiments, whether or not muting of subsequent input speech information of the user device is required may be determined based on whether or not the current face orientation information satisfies a predetermined deviation angle threshold (e.g., the deviation angle threshold is 30 degrees, the face is facing forward on the screen or the microphone is 0 degrees, and if the current face orientation information includes that the current face is 45 degrees off the screen counterclockwise and is greater than the deviation angle threshold 30 degrees, it is determined that muting of subsequent input speech information is required). In some embodiments, whether or not to mute the subsequent input voice information of the user equipment may also be determined based on whether or not the current face distance information is greater than or equal to a predetermined distance threshold (e.g., the distance threshold is 15 centimeters, the screen or microphone is located at 0 centimeters, and if the current face is located 20 centimeters away from the screen, greater than 15 centimeters, it is determined that muting of the subsequent input voice information is required). In some embodiments, whether or not the subsequent input voice information of the user equipment needs to be muted may be determined based on whether or not the current face direction information satisfies a predetermined direction angle threshold (e.g., the direction angle threshold is 45 degrees, and 0 degrees directly in front of the screen or the microphone, and if the current face is in a direction of 50 degrees right in front of the screen, which is greater than 45 degrees, it is determined that the subsequent input voice information needs to be muted). In some embodiments, it may be determined comprehensively according to the three or two of the three that whether or not muting processing is required for the subsequent input voice information of the user equipment is required, for example, if the current face orientation information includes that the current face is deviated from the screen by 45 degrees counterclockwise and the current face is located at a position 20 centimeters away from the screen in a direction 50 degrees forward right of the screen, it may be determined that muting processing is required for the subsequent input voice information of the user equipment according to the deviation of the screen by 45 degrees greater than a predetermined deviation angle threshold (e.g., 30 degrees) and the deviation of the screen by 20 centimeters greater than a predetermined distance threshold (e.g., 15 centimeters) and the deviation of the screen by 50 degrees forward right of the screen by greater than a predetermined direction angle threshold (e.g., 45 degrees). In some embodiments, if the multi-person voice is a multi-person voice call, a multi-person voice online conference, or the like, the subsequently input voice information may be voice information in the subsequently input video stream. In some embodiments, if the multi-user voice is a multi-user voice call, a multi-user voice online conference, etc., muting the subsequent input voice information includes, but is not limited to, stopping sending the subsequent input voice information to other users or servers in the multi-user voice, reducing the volume of the subsequent input voice information, stopping collecting the subsequent input voice information by a microphone, if the multi-user voice is a multi-user video call, a multi-user video online conference, etc., muting the subsequent input voice information includes, but is not limited to, sending subsequent input video stream information not containing audio information only to other users or servers in the multi-user voice, reducing the audio volume in the subsequent input video stream information, collecting subsequent input video stream information not containing audio information only by a camera, so that the other users in the multi-user voice can not hear the voice of the first user, alternatively, the other users are made to see only the first user's video stream that does not contain audio information. According to the method and the device, through acquiring the current face position information of the user, whether muting processing needs to be carried out on the subsequent input voice information of the user equipment can be determined according to the current face position information, so that muting processing is carried out on the input voice information of the user equipment automatically in real time, interference of irrelevant voice can be avoided, compared with a manual muting mode of the user, any operation is not required to be manually carried out by the user, great convenience can be provided for the user, user experience can be enhanced, the call quality of multi-user voice is improved, and interference in multi-user voice call is reduced.

In some embodiments, the current face position information comprises current face orientation information and/or current face direction information; wherein the one-to-one module 11 comprises a three-module 13 (not shown). And a third module 13, configured to determine, during a call of a multi-user voice, current face orientation information and/or current face direction information of the first user according to first image information acquired by a camera in the user equipment. Here, the specific implementation of a third module 13 is the same as or similar to the embodiment related to step S13 in fig. 1, and therefore, the detailed description is omitted, and the detailed implementation is incorporated herein by reference.

In some embodiments, the current face position information further includes current face distance information, the camera device being a depth camera device; wherein the one-three module 13 is configured to: and in the process of the multi-person voice communication, determining the current face orientation information and/or the current face direction information and/or the current face distance information of the first user according to the first image information collected by the depth camera device. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-three module 13 includes one-four module 14 (not shown). A fourth module 14, configured to identify a face object from first image information acquired by a camera in the user equipment during a call with voice of multiple persons; and determining current face orientation information and/or current face direction information of the first user according to the recognition result. Here, the specific implementation of a quad-module 14 is the same as or similar to the embodiment related to step S14 in fig. 1, and therefore, the detailed description is omitted, and the detailed implementation is incorporated herein by reference.

In some embodiments, the face object is a face object of the first user; wherein the four modules 14 are configured to: in the process of the multi-user voice communication, according to the pre-acquired face feature information of the first user, the face object of the first user is identified from the first image information acquired by the camera device. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the current face position information comprises current face distance information and/or face direction information; wherein the one-to-one module 11 comprises a five-to-one module 15 (not shown). And a fifthly module 15, configured to, during a call of multi-user voice, perform sound source localization on input voice information corresponding to the user equipment, and determine current face distance information and/or current face direction information of the first user. Here, the specific implementation manner of the fifth module 15 is the same as or similar to the embodiment related to step S15 in fig. 1, and therefore, the detailed description is not repeated here, and is incorporated herein by reference.

In some embodiments, the one-five module 15 includes a six module 16 (not shown) and a seven module 17 (not shown). A sixth module 16, configured to identify voice information included in input voice information corresponding to the user equipment during a call of multi-user voice; and a seventh module 17, configured to determine current face distance information and/or current face direction information of the first user by performing sound source localization on the voice information obtained through recognition. Here, the specific implementation manners of the sixth module 16 and the seventh module 17 are the same as or similar to those of the embodiments related to steps S16 and S17 in fig. 1, and therefore, the detailed descriptions thereof are omitted, and the detailed descriptions thereof are incorporated herein by reference.

In some embodiments, the vocal information is voice information of the first user; wherein the sixth module 16 is configured to: in the process of multi-user voice communication, according to the pre-acquired voiceprint feature information of the first user, recognizing and obtaining voice information of the first user from input voice information corresponding to the user equipment; wherein the seventeenth module 17 is configured to: and determining the current face distance information and/or the current face direction information of the first user by carrying out sound source positioning on the sound information of the first user. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the determining whether muting processing of subsequent input speech information of the user equipment is required according to the current face position information includes: determining the holding duration information of the current face position information according to the current face position information and by combining historical face position information; and if the holding duration information is greater than or equal to a preset duration threshold, determining whether to mute the subsequent input voice information of the user equipment or not according to the current face position information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, wherein the current face position information comprises current face orientation information and/or current face direction information and/or the current face distance information; wherein the determining whether to mute the subsequent input voice information of the user equipment according to the current face position information includes at least one of: determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face orientation information meets a preset deviation angle threshold value; determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face direction information meets a preset direction angle threshold value; and determining whether the subsequent input voice information of the user equipment needs to be subjected to mute processing or not according to whether the current face distance information is larger than or equal to a preset distance threshold or not. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the determining whether or not to mute the subsequent input speech information of the user device based on the current face position information includes an eight module 18 (not shown) and a nine module 19 (not shown). An eighth module 18, configured to determine, according to the current face position information and in combination with historical face position information of the first user, face movement related information of the first user; a nine module 19, configured to determine whether to mute subsequent input voice information of the user equipment according to the face motion related information. Here, the specific implementation manners of the eight module 18 and the nine module 19 are the same as or similar to those of the embodiments related to steps S18 and S19 in fig. 1, and therefore, the detailed description is omitted, and the detailed description is incorporated herein by reference.

In some embodiments, the face movement related information comprises face movement direction information; wherein the one-nine module 19 includes a two-zero module 20 (not shown). A second zero module 20, configured to determine whether the face movement direction of the first user is a direction away from the user device according to the face movement direction information; if yes, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. Here, the specific implementation manner of the binary-zero module 20 is the same as or similar to the embodiment related to step S20 in fig. 1, and therefore, the detailed description is omitted here and is incorporated herein by reference.

In some embodiments, the facial motion related information further comprises facial motion offset information; wherein the two-zero module 20 is configured to: determining whether the face movement direction of the first user is a direction away from the user equipment according to the face movement direction information and whether the offset information is greater than or equal to predetermined offset threshold information according to the face movement; if the face movement direction is a direction far away from the user equipment and the face movement offset information is greater than or equal to the preset offset threshold information, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the face movement related information further comprises face movement speed information; wherein the two-zero module 20 is configured to: determining whether the face movement direction of the first user is a direction away from the user equipment according to the face movement direction information and determining whether the face movement speed of the first user is greater than or equal to a predetermined speed threshold according to the face movement speed information; if the face movement direction is a direction far away from the user equipment and the face movement speed is greater than or equal to a preset speed threshold value, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

2) Reducing the volume of the subsequent input voice information

3) Stopping collecting the subsequent input voice information

4) Any combination of the above mute processing modes

Here, the related mute processing method is the same as or similar to that of the embodiment shown in fig. 1, and therefore, the description thereof is omitted, and the related mute processing method is incorporated herein by reference.

In some embodiments, as shown in FIG. 3, the system 300 can be implemented as any of the devices in the various embodiments described. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a holding computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

The present application also provides a computer readable storage medium having stored thereon computer code which, when executed, performs a method as in any one of the preceding.

The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Those skilled in the art will appreciate that the form in which the computer program instructions reside on a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Computer-readable media herein can be any available computer-readable storage media or communication media that can be accessed by a computer.

Communication media includes media by which communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for silence processing in multi-person voice, applied to a user equipment, wherein the method comprises:

determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment or not according to the current face position information; if so, performing mute processing on the subsequent input voice information; otherwise, the subsequent input voice information is not subjected to mute processing;

wherein, the determining whether to mute the subsequent input voice information of the user equipment according to the current face position information comprises:

determining face movement related information of the first user according to the current face position information and by combining historical face position information of the first user, wherein the face movement related information comprises face movement direction information and face movement speed information, and the face movement speed information is used for indicating the current movement speed of the face of the first user;

determining whether the face movement direction of the first user is a direction away from the user equipment according to the face movement direction information and determining whether the face movement speed of the first user is greater than or equal to a predetermined speed threshold according to the face movement speed information; if the face movement direction is a direction far away from the user equipment and the face movement speed is greater than or equal to a preset speed threshold value, determining that mute processing needs to be carried out on subsequent input voice information of the user equipment; otherwise, determining that the subsequent input voice information of the user equipment does not need to be subjected to mute processing.

2. The method of claim 1, wherein the current face position information comprises current face orientation information and/or current face orientation information;

wherein, in the conversation process of the multi-person voice, the current face position information of the first user is obtained, and the method comprises the following steps:

in the process of a multi-person voice call, determining the current face orientation information and/or the current face direction information of a first user according to first image information acquired by a camera device in user equipment.

3. The method of claim 2, wherein the current face position information further includes current face distance information, the camera being a depth camera;

in the process of a call of a multi-user voice, determining current face orientation information and/or current face direction information of a first user according to first image information acquired by a camera in the user equipment, including:

and in the process of the multi-person voice communication, determining the current face orientation information and/or the current face direction information and/or the current face distance information of the first user according to the first image information collected by the depth camera device.

4. The method according to claim 2, wherein the determining, during the call of the multi-person voice, the current face orientation information and/or the current face direction information of the first user according to the first image information collected by the camera in the user equipment comprises:

in the conversation process of multi-person voice, recognizing a face object from first image information acquired by a camera device in user equipment;

and determining current face orientation information and/or current face direction information of the first user according to the recognition result.

5. The method of claim 4, wherein the face object is a face object of the first user;

wherein, in the conversation process of the voice of a plurality of people, the face object is identified from the first image information collected by the camera device in the user equipment, and the method comprises the following steps:

in the process of the multi-user voice communication, according to the pre-acquired face feature information of the first user, the face object of the first user is identified from the first image information acquired by the camera device.

6. The method of claim 2, wherein the current face position information comprises current face distance information and/or current face direction information;

in the process of multi-user voice communication, the current face distance information and/or the current face direction information of the first user are determined by carrying out sound source positioning on the input voice information corresponding to the user equipment.

7. The method of claim 6, wherein the determining the current face distance information and/or the current face direction information of the first user by performing sound source localization on the input voice information corresponding to the user equipment during the call of the multi-person voice comprises:

in the conversation process of multi-person voice, identifying voice information contained in input voice information corresponding to the user equipment;

and determining the current face distance information and/or the current face direction information of the first user by carrying out sound source positioning on the voice information obtained by identification.

8. The method of claim 7, wherein the vocal information is vocal information of the first user;

in the process of a multi-user voice call, identifying voice information contained in input voice information corresponding to the user equipment, including:

in the process of multi-user voice communication, according to the pre-acquired voiceprint feature information of the first user, recognizing and obtaining voice information of the first user from input voice information corresponding to the user equipment;

the determining the current face distance information and/or the current face direction information of the first user by performing sound source localization on the voice information obtained by identification includes:

and determining the current face distance information and/or the current face direction information of the first user by carrying out sound source positioning on the sound information of the first user.

9. The method of claim 1, wherein the determining whether muting of subsequent input speech information of the user device is required based on the current face position information comprises:

determining the holding duration information of the current face position information according to the current face position information and by combining historical face position information;

and if the holding duration information is greater than or equal to a preset duration threshold, determining whether to mute the subsequent input voice information of the user equipment or not according to the current face position information.

10. The method according to claim 1 or 9, wherein the current face position information comprises current face orientation information and/or current face direction information and/or the current face distance information;

wherein the determining whether to mute the subsequent input voice information of the user equipment according to the current face position information includes at least one of:

determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face orientation information meets a preset deviation angle threshold value;

determining whether mute processing needs to be carried out on subsequent input voice information of the user equipment according to whether the current face direction information meets a preset direction angle threshold value;

and determining whether the subsequent input voice information of the user equipment needs to be subjected to mute processing or not according to whether the current face distance information is larger than or equal to a preset distance threshold or not.

11. The method of any of claims 1-10, wherein the muting the subsequent input speech information comprises at least one of:

stopping sending the subsequent input voice information to other users in the multi-person voice;

reducing the volume of the subsequent input voice information;

and stopping collecting the subsequent input voice information.

12. An apparatus for silence processing in multi-person speech, the apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method of any of claims 1 to 11.

13. A computer-readable medium storing instructions that, when executed, cause a system to perform the operations of any of the methods of claims 1-11.