WO2021212608A1 - 定位声源用户的方法、装置和计算机设备 - Google Patents

定位声源用户的方法、装置和计算机设备 Download PDF

Info

Publication number
WO2021212608A1
WO2021212608A1 PCT/CN2020/093425 CN2020093425W WO2021212608A1 WO 2021212608 A1 WO2021212608 A1 WO 2021212608A1 CN 2020093425 W CN2020093425 W CN 2020093425W WO 2021212608 A1 WO2021212608 A1 WO 2021212608A1
Authority
WO
WIPO (PCT)
Prior art keywords
span
designated
user
orientation
sound source
Prior art date
Application number
PCT/CN2020/093425
Other languages
English (en)
French (fr)
Inventor
龚连银
苏雄飞
周宝
陈远旭
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021212608A1 publication Critical patent/WO2021212608A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes

Definitions

  • This application relates to the fields of artificial intelligence and blockchain, and particularly relates to methods, devices and computer equipment for locating users of sound sources.
  • the main purpose of this application is to provide a method for locating users of sound sources, which aims to solve the technical problem that the existing robot positioning system cannot meet the requirements for precise positioning in various scenarios.
  • This application proposes a method for locating a sound source user, including:
  • the designated position corresponding to the sound source identified by the sound source location, and the visual centerline position corresponding to the current spatial position of the robot obtain the pre-rotated spatial area span according to the designated position and the visual centerline position; according to the pre-rotated space
  • the area span controls the rotation of the robot until the specified position is within the visual range of the robot; it is judged whether the user portrait of the specified user is obtained within the field of view of the robot; if so, the movement data of the specified user is obtained and processed through a preset method.
  • Obtain the processing result and input the processing result into the VGG network for recognition calculation to obtain the action type corresponding to the action data; receive the data result output after the VGG network recognition calculation, and judge whether the sound source azimuth is the same as specified according to the data result of the VGG network
  • the orientations are consistent, where the data result includes that the action type is a mouth action; if it is, it is determined that the designated user in the designated orientation is the sound source user.
  • This application also provides a device for locating a user of a sound source, including:
  • the first acquisition module is used to acquire the designated orientation corresponding to the sound source identified by the sound source location and the visual centerline orientation corresponding to the current spatial position of the robot; the acquisition module is used to obtain the designated orientation and the visual centerline orientation according to the designated orientation and the visual centerline orientation of the robot
  • the pre-rotated spatial area span; the rotation module is used to control the rotation of the robot according to the pre-rotated spatial area span, and rotate to a specified position within the robot's visual range;
  • the first judgment module is used to judge whether the acquisition is within the robot's field of view To the user portrait of the designated user;
  • the second acquisition module is used to acquire the action data of the designated user, and process it in a preset manner to obtain the processing result, and input the processing result into the VGG network for identification calculation to obtain the action
  • the action type corresponding to the data; the receiving module is used to receive the data result output after the VGG network identification calculation, and according to the data result of the VGG network to determine whether the sound source position is consistent with the specified position
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of the method when the computer program is executed.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and the steps of the method are implemented when the computer program is executed by the processor.
  • This application uses a series of human motion data as the input of the VGG network in visual positioning, improves the accuracy of distinction through motion data, and uses visual positioning and sound positioning in a comprehensive manner to improve the accuracy of the robot's positioning and speaking of the target user. Spend.
  • Fig. 1 is a schematic flowchart of a method for locating a sound source user according to an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of an apparatus for locating a user of a sound source according to an embodiment of the present application
  • Fig. 3 is a schematic diagram of the internal structure of a computer device according to an embodiment of the present application.
  • a method for locating a sound source user includes:
  • S1 Obtain the specified orientation corresponding to the sound source identified by the sound source localization, and the visual centerline orientation corresponding to the current spatial position of the robot.
  • the sound source localization is realized by the microphone array.
  • the delay parameters for each microphone in the array By setting the delay parameters for each microphone in the array, and by controlling different delay parameters, different azimuth directions can be achieved, and the positioning area can be gridded, and each grid point delays each microphone in the time domain.
  • the robot has both sound source localization and visual localization, and the direction of the visual centerline is the center position in the field of view. For example, it is determined according to whether the robot chooses a monocular structure or a binocular structure.
  • the direction of the line passing through the center of the monocular and perpendicular to the plane of the robot's face is used as the direction of the visual centerline;
  • the binocular structure is through the binocular connecting line
  • the midpoint of is perpendicular to the mid-perpendicular direction of the plane where the face of the robot is located is the direction of the visual centerline.
  • the spatial area span includes the area corresponding to the arc range from the robot’s current visual centerline azimuth to the specified azimuth, from the current visual centerline azimuth to the corresponding radians when rotating counterclockwise to the specified azimuth, or from the current visual centerline azimuth.
  • the corresponding arc area Preliminary positioning of the sound source is used to assist the robot to quickly adjust the orientation of the visual positioning, and improve the response sensitivity and accuracy.
  • S3 Control the robot to rotate according to the pre-rotated spatial area span, and rotate to the designated position within the vision range of the robot.
  • the designated orientation is within the vision range of the robot, including any position within the vision range, and preferably the designated orientation coincides with the orientation of the visual centerline to improve the accuracy of visual positioning.
  • Rotation includes rotating the head equipped with a camera, or rotating the entire body of the robot. During the rotation process, the camera can be aligned with the speaker's position by controlling the waist and head yaw angle of the robot, that is, aiming at the designated position.
  • S4 Determine whether the user portrait of the designated user is obtained within the field of view of the robot.
  • the user portrait includes a head portrait, so that by recognizing the mouth movements in the head portrait, it is estimated whether the user is speaking or not.
  • the preset method processing includes splicing the acquired mouth motion video information into a single picture information carrying a time sequence, so as to be recognized by the VGG network.
  • S6 Receive the data result output after the VGG network recognition calculation, and judge whether the sound source position is consistent with the specified position according to the data result of the VGG network, where the data result includes that the action type belongs to the mouth movement.
  • the results of the data output by the VGG network include whether there are mouth movements. For example, if there is a large change in the shape of the mouth according to the time sequence in the picture information, it is considered that there is a mouth movement, otherwise it does not exist. If the VGG network determines that the designated user at the designated position has a mouth movement, and at the same time the sound source position designated by the sound source location is the same as the pre-designated position, the designated user is determined to be the sound source user.
  • the embodiments of this application determine the approximate location of the target user through the technology of sound source positioning, and quickly give the positioning result; then the target user is accurately positioned through visual positioning, and the series of human motion data is used as the VGG network in the visual positioning. Input and improve the accuracy of distinguishing target users through motion data.
  • the action data Before the action data is input into the VGG network, it must pass a specific data processing method, so that the processed data can be recognized and calculated by the VGG network, eliminating the interference of artificial people or objects like users on the visual positioning.
  • the target user refers to the field of vision. Specify the user.
  • step S5 of obtaining the action data of the designated user and processing it in a preset manner to obtain the processing result, and inputting the processing result to the VGG network for identification calculation to obtain the action type corresponding to the action data includes:
  • S51 Obtain the action data of the specified user in the specified time period, the action data is a continuous multi-frame action sequence;
  • S52 Pass the continuous multi-frame action sequence Merge and splice into a static image data, where p i ⁇ R n represents the key point at time t, i represents the sequence number of the key point;
  • B i,k (t) represents the transformation matrix, and k represents the dimension;
  • p(t) is t ⁇ [t i , t i+1 ) output static image data;
  • S53 input the static image data to the VGG network for recognition calculation.
  • This application applies image and video recognition technology in the field of artificial intelligence, where the designated time period refers to the continuous time span of the mouth motion video collected by the camera.
  • the mouth motion video is formed into a static image data so that it can be recognized and calculated by the VGG network.
  • B i,k (t) represents the transformation matrix
  • k represents the dimension, such as p(t) is the output result within t ⁇ [t i ,t i+1 ), and R n represents an integer in a real number.
  • This formula can also be written as It is equivalent to t ⁇ [t i ,t i+1 ) in the last arbitrary time period.
  • the key points of these users are synthesized by the motion key points of multiple frames, thus realizing the synthesis and input of multiple frames of continuous motion sequences.
  • Information structure, the result of VGG network classification can also be targeted at sports actions, and M 6 represents a 6*6 matrix.
  • step S5 of obtaining the action data of the specified user and processing it in a preset manner to obtain the processing result, and inputting the processing result to the VGG network for identification calculation to obtain the action type corresponding to the action data includes:
  • S50a Determine whether the number of designated users in the field of view of the robot is two or more; S50b: If yes, select the corresponding square area of each designated user in the field of view map corresponding to the field of view of the robot according to the Yolov3 algorithm; S50c: Separately intercept a series of actions within a specified time period corresponding to each block area as action data.
  • Yolov3 is a one-stage End2End object detector.
  • Yolov3 divides the input image into S*S grids, and each grid predicts B bounding boxes.
  • the predicted content of each boundingbox includes: Location(x,y,w,h), Confidence Score and the probability of C categories, so Yolov3 outputs
  • the number of channels in the layer is S*S*B*(5+C).
  • the loss function of Yolov3 consists of three parts: Location error, Confidence error and classification error.
  • step S2 of obtaining the pre-rotated spatial region span according to the designated orientation and the orientation of the visual centerline includes:
  • S21 Obtain the first area span when rotated clockwise from the visual centerline to the specified position, and the second area span when rotated counterclockwise from the visual centerline to the specified position;
  • S22 compare the first area span with the second The size of the area span;
  • S23 When the first area span is greater than the second area span, the second area span is regarded as the spatial area span, and when the first area span is not greater than the second area span, the first area span is regarded as the spatial area span.
  • This embodiment takes the existence of a designated orientation as an example.
  • the visual centerline orientation is rotated to the direction corresponding to the designated orientation, so that the designated orientation is within the rotated field of view, preferably the designated orientation
  • the orientation of the pre-rotation adjusted visual centerline coincides.
  • the control uses a small arc area as the span of the space area to be rotated.
  • the number of designated orientations is two or more
  • the spatial region span includes two or more
  • the step S2 of obtaining the pre-rotated spatial region span according to the designated orientation and the visual centerline orientation includes:
  • S31 Obtain the first total area span corresponding to clockwise rotation from the visual centerline azimuth through all specified azimuths, and rotate counterclockwise from the visual centerline azimuth to all the corresponding second total area spans; S32: compare the first total area span S33: When the first total area span is greater than the second total area span, the second total area span is taken as the spatial area span, when the first total area span is not greater than the second total area span , Regard the first total area span as the spatial area span.
  • the embodiment of the present application takes the existence of multiple designated orientations as an example, that is, multiple areas emit sounds at the same time or consecutively, and multiple areas need to be visually accurately positioned in sequence.
  • the largest coverage arc is selected as the total area span.
  • the first total area span is taken as the first total area span by turning clockwise through the largest coverage arc interval of each designated azimuth.
  • the second total area span is taken as the second total area span by turning counterclockwise through the largest coverage arc interval of each designated azimuth in turn.
  • the user's motion data corresponding to each designated position is sequentially analyzed to achieve precise positioning of the speaker.
  • step S6 of receiving the data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the designated azimuth according to the data result of the VGG network includes:
  • S61 Whether the result of analyzing the data includes the mouth opening and closing movement; S62: If it is, determine again whether the current sound source orientation is the designated orientation; S63: If it is, determine that the sound source orientation is consistent with the designated orientation, otherwise, it is inconsistent.
  • the sound source localization is called again for auxiliary analysis. If both the sound source localization and the visual localization point to the designated user as the speaker, it is judged Specify the user as the speaker. That is, if there are mouth movements plus the correct orientation of the voice of the designated user, it is determined that the designated user is speaking. The judgment of the two points to not focus, and then by continuing the circular judgment process, looking for the sound source user, that is, the speaker. For example, there are mouth movements but the direction of the voice to the specified user is not the direction of the source.
  • VGG can only process static picture information to identify the characteristics of the marked points in the picture, such as fruit type recognition based on the characteristics of the marked points in the picture, etc. It cannot directly measure the action information, such as mouth opening and closing movements.
  • the multi-frame pictures of the action video are spliced and input into VGG, and the change trajectory of the position of the mark points in the picture is obtained according to the output data of VGG, and the mouth is judged whether there is an opening and closing action, and the mouth is judged based on the sound source location
  • the opening and closing action is consistent with the orientation of the sound source location.
  • the user is determined to be the speaker, that is, the sound source user.
  • the sound source position is still determined by the microphone array sound source positioning technology.
  • step S61 of analyzing whether the data result includes the mouth opening and closing action the method includes:
  • S60a Determine whether the focus condition of the camera is normal relative to the distance of the specified user from the camera; S60b: If yes, determine whether the resolution of the user portrait obtained under the focus condition is within the preset range; S60c: If yes, control the VGG network Identify the calculation, otherwise terminate the calculation.
  • the action data can also be stored in a node of a blockchain.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • this solution can also be applied in the field of smart transportation to promote the construction of smart cities.
  • This embodiment eliminates the interference of virtual characters in the electronic screen with the positioning of the real speaker through the resolution. Due to the reflective nature of the electronic screen, the resolution of the image or video of the real user captured under the same distance and the same focusing conditions is required. It is much higher than the resolution of the virtual user in the photographed electronic screen. When the resolution does not meet the requirements, the VGG network recognition calculation is directly terminated, and a conclusion is output whether the sound source azimuth is inconsistent with the specified azimuth.
  • an apparatus for locating a user of a sound source includes:
  • the first acquisition module 1 is used to acquire the designated orientation corresponding to the sound source identified by the sound source localization and the visual centerline orientation corresponding to the current spatial position of the robot.
  • the sound source localization is realized by the microphone array.
  • the delay parameters for each microphone in the array By setting the delay parameters for each microphone in the array, and by controlling different delay parameters, different azimuth directions can be achieved, and the positioning area can be gridded, and each grid point delays each microphone in the time domain.
  • the robot has both sound source localization and visual localization, and the direction of the visual centerline is the center position in the field of view. For example, it is determined according to whether the robot chooses a monocular structure or a binocular structure.
  • the direction of the line passing through the center of the monocular and perpendicular to the plane of the robot's face is used as the direction of the visual centerline;
  • the binocular structure is through the binocular connecting line
  • the midpoint of is perpendicular to the mid-perpendicular direction of the plane where the face of the robot is located is the direction of the visual centerline.
  • the spatial area span includes the area corresponding to the arc range from the robot’s current visual centerline azimuth to the specified azimuth, from the current visual centerline azimuth to the corresponding radians when rotating counterclockwise to the specified azimuth, or from the current visual centerline azimuth.
  • the corresponding arc area Preliminary positioning of the sound source is used to assist the robot to quickly adjust the orientation of the visual positioning, and improve the response sensitivity and accuracy.
  • the rotation module 3 is used to control the rotation of the robot according to the pre-rotated spatial area span, and rotate to a designated position within the vision range of the robot.
  • the designated orientation is within the vision range of the robot, including any position within the vision range, and preferably the designated orientation coincides with the orientation of the visual centerline to improve the accuracy of visual positioning.
  • Rotation includes rotating the head equipped with a camera, or rotating the entire body of the robot. During the rotation process, the camera can be aligned with the speaker's position by controlling the waist and head yaw angle of the robot, that is, aiming at the designated position.
  • the first judgment module 4 is used for judging whether the user portrait of the designated user is obtained in the field of view of the robot.
  • the user portrait includes a head portrait, so that by recognizing the mouth movements in the head portrait, it is estimated whether the user is speaking or not.
  • the second acquisition module 5 is used to acquire the action data of the specified user if it is, and process it in a preset manner to obtain the processing result, and input the processing result into the VGG network for identification calculation to obtain the action type corresponding to the action data.
  • the preset method processing includes splicing the acquired mouth motion video information into a single picture information carrying a time sequence, so as to be recognized by the VGG network.
  • the receiving module 6 is used to receive the data result output after the VGG network identification calculation, and judge whether the sound source position is consistent with the designated position according to the data result of the VGG network, wherein the data result includes that the action type is a mouth movement.
  • the judging module 7 is used for judging that the designated user in the designated position is the sound source user if it is so.
  • the results of the data output by the VGG network include whether there are mouth movements. For example, if there is a large change in the shape of the mouth according to the time sequence in the picture information, it is considered that there is a mouth movement, otherwise it does not exist. If the VGG network determines that the designated user at the designated position has a mouth movement, and at the same time the sound source position designated by the sound source location is the same as the pre-designated position, the designated user is determined to be the sound source user.
  • the embodiments of this application determine the approximate location of the target user through the technology of sound source positioning, and quickly give the positioning result; then the target user is accurately positioned through visual positioning, and the series of human motion data is used as the VGG network in the visual positioning. Input and improve the accuracy of distinguishing target users through motion data.
  • the action data Before the action data is input into the VGG network, it must pass a specific data processing method, so that the processed data can be recognized and calculated by the VGG network, eliminating the interference of artificial people or objects like users on the visual positioning.
  • the target user refers to the field of vision. Specify the user.
  • the second acquisition module 5 includes:
  • the first acquiring unit is used to acquire the action data of the specified user in the specified time period, and the action data is a continuous multi-frame action sequence;
  • the splicing unit is used to combine the continuous multi-frame action sequence through Merge and splice into a static image data, where p i ⁇ R n represents the key point at time t, i represents the sequence number of the key point;
  • Bi,k (t) represents the transformation matrix, k represents the dimension;
  • p(t) is t ⁇ [t i , t i+1 )
  • the static image data output within the time;
  • the input unit is used to input the static image data to the VGG network for recognition calculation.
  • the specified time period refers to the continuous time span of the mouth motion video captured by the camera.
  • the mouth motion video is formed into a static image data so that it can be recognized and calculated by the VGG network.
  • B i,k (t) represents the transformation matrix
  • k represents the dimension, such as p(t) is the output result within t ⁇ [t i ,t i+1 ), and R n represents an integer in a real number.
  • This formula can also be written as It is equivalent to t ⁇ [t i ,t i+1 ) in the last arbitrary time period.
  • the key points of these users are synthesized by the motion key points of multiple frames, thus realizing the synthesis and input of multiple frames of continuous motion sequences.
  • Information structure, the result of VGG network classification can also be targeted at sports actions, and M 6 represents a 6*6 matrix.
  • the device for locating the user of the sound source includes:
  • the second judgment module is used to judge whether the number of designated users in the field of view of the robot is two or more; the selection module is used to select each field in the field of view map corresponding to the field of view of the robot according to the Yolov3 algorithm. Specify the block areas corresponding to the users; the interception module is used to intercept the series of actions in the specified time period corresponding to each block area as the action data.
  • Yolov3 is a one-stage End2End object detector.
  • Yolov3 divides the input image into S*S grids, and each grid predicts B bounding boxes.
  • the predicted content of each boundingbox includes: Location(x,y,w,h), Confidence Score and the probability of C categories, so Yolov3 outputs
  • the number of channels in the layer is S*S*B*(5+C).
  • the loss function of Yolov3 consists of three parts: Location error, Confidence error and classification error.
  • module 2 is obtained, including:
  • the second acquiring unit is used to acquire the first area span when rotating clockwise from the visual centerline to the designated position, and the second area span when rotating counterclockwise from the visual centerline to the designated position; the first comparing unit, Used to compare the size of the first area span and the second area span; the first is used as a unit, when the first area span is greater than the second area span, the second area span is used as the spatial area span, when the first area span is not When it is greater than the second area span, the first area span is taken as the spatial area span.
  • This embodiment takes the existence of a designated orientation as an example.
  • the visual centerline orientation is rotated to the direction corresponding to the designated orientation, so that the designated orientation is within the rotated field of view, preferably the designated orientation
  • the orientation of the pre-rotation adjusted visual centerline coincides.
  • the control uses a small arc area as the span of the space area to be rotated.
  • obtaining module 2 includes:
  • the third acquisition unit is used to acquire the first total area spans corresponding to all designated directions rotated clockwise from the visual centerline azimuth, and all corresponding second total area spans rotated counterclockwise from the visual centerline azimuth; second comparison The unit is used to compare the size of the first total area span and the second total area span; the second is used as a unit, when the first total area span is greater than the second total area span, the second total area span is used as the spatial area span , When the first total area span is not greater than the second total area span, the first total area span is taken as the spatial area span.
  • the embodiment of the present application takes the existence of multiple designated orientations as an example, that is, multiple areas emit sounds at the same time or consecutively, and multiple areas need to be visually accurately positioned in sequence.
  • the largest coverage arc is selected as the total area span.
  • the first total area span is taken as the first total area span by turning clockwise through the largest coverage arc interval of each designated azimuth.
  • the second total area span is taken as the second total area span by turning counterclockwise through the largest coverage arc interval of each designated azimuth in turn.
  • the user's motion data corresponding to each designated position is sequentially analyzed to achieve precise positioning of the speaker.
  • the receiving module 6 includes:
  • the analysis unit is used to analyze whether the data result includes the mouth opening and closing action; the determining unit is used to determine whether the current sound source position is the designated position again if it is; the determining unit is used to determine the sound source position and the designated position if it is The orientation is the same, otherwise, it is inconsistent.
  • the sound source localization is called again for auxiliary analysis. If both the sound source localization and the visual localization point to the designated user as the speaker, it is judged Specify the user as the speaker. That is, if there are mouth movements plus the correct orientation of the voice of the designated user, it is determined that the designated user is speaking. The judgment of the two points to not focus, and then by continuing the circular judgment process, looking for the sound source user, that is, the speaker. For example, there are mouth movements but the direction of the voice to the specified user is not the direction of the source.
  • VGG can only process static picture information to identify the characteristics of the marked points in the picture, such as fruit type recognition based on the characteristics of the marked points in the picture, etc. It cannot directly measure the action information, such as mouth opening and closing movements.
  • the multi-frame pictures of the action video are spliced and input into VGG, and the change trajectory of the position of the mark points in the picture is obtained according to the output data of VGG, and the mouth is judged whether there is an opening and closing action, and the mouth is judged based on the sound source location
  • the opening and closing action is consistent with the orientation of the sound source location.
  • the user is determined to be the speaker, that is, the sound source user.
  • the sound source position is still determined by the microphone array sound source positioning technology.
  • the receiving module 6 includes:
  • the first judging unit is used to judge whether the focusing condition of the camera is normal relative to the distance of the designated user from the camera; the second judging unit is used to judge whether the resolution of the user portrait obtained under the focusing condition is within the preset range Inside; the control unit, if yes, control the VGG network identification calculation, otherwise terminate the calculation.
  • This embodiment eliminates the interference of virtual characters in the electronic screen with the positioning of the real speaker through the resolution. Due to the reflective nature of the electronic screen, the resolution of the image or video of the real user captured under the same distance and the same focusing conditions is required. It is much higher than the resolution of the virtual user in the photographed electronic screen. When the resolution does not meet the requirements, the VGG network recognition calculation is directly terminated, and a conclusion is output whether the sound source azimuth is inconsistent with the specified azimuth.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the database of the computer equipment is used to store all the data needed in the process of locating the user of the sound source.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize the method of locating the user of the sound source.
  • the processor executes the method of locating the user of the sound source, including: obtaining the designated position corresponding to the sound source identified by the sound source localization, and the visual centerline position corresponding to the current spatial position of the robot; according to the designated position and the visual centerline position, obtain Span of pre-rotated spatial area; control the robot to rotate according to the pre-rotated spatial area span, and rotate to the designated orientation within the visual range of the robot; determine whether the user portrait of the designated user is obtained in the field of view of the robot; if so, obtain the designated
  • the user's action data is processed in a preset manner to obtain the processing result, and the processing result is input to the VGG network for identification calculation to obtain the action type corresponding to the action data; the data result output after the VGG network identification calculation is received, and The data result of the VGG network judges whether the sound source position is consistent with the designated position, where the data result includes that the action type belongs to the mouth movement; if it is, it is determined that the designated user in the designated position is the sound source user
  • Computer equipment through the use of a series of human motion data as the input of the VGG network in the visual positioning, the motion data is used to improve the accuracy of the distinction, and the visual positioning and sound positioning are combined to improve the robot's positioning and speaking of the target user. Accuracy.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored.
  • a method for locating a user of a sound source includes: obtaining the corresponding sound source identified by the sound source location Specify the orientation and the orientation of the visual centerline corresponding to the current spatial position of the robot; get the pre-rotated spatial region span according to the designated orientation and the visual centerline orientation; control the robot to rotate according to the pre-rotated spatial region span, and rotate to the designated orientation Within the vision range of the robot; determine whether the user portrait of the specified user is obtained within the field of view of the robot; if so, obtain the action data of the specified user, and process it through a preset method to obtain the processing result, and input the processing result into the VGG
  • the network performs recognition calculations to obtain the action type corresponding to the action data; receives the data result output after the VGG network recognition calculation, and judges whether the sound source location is consistent with the specified location according to the VGG network data result, where the data result includes
  • the computer-readable storage medium uses a series of human motion data as the input of the VGG network in the visual positioning, improves the accuracy of the distinction through the motion data, and uses the visual positioning and sound positioning to improve the robot positioning and speaking. The accuracy of the target user.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Manipulator (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及人工智能及区块链技术,揭示了定位声源用户的方法包括:获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度;根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内;判断在机器人的视野范围内是否获取到指定用户的用户画像;若是则获取指定用户的动作数据,经过预设方式处理得到处理结果,将处理结果输入至VGG网络进行识别计算得到动作类型;接收VGG网络识别计算后输出的数据结果,根据VGG网络的数据结果判断声源方位是否与指定方位相一致;若是,则判定指定方位的指定用户为声源用户提高定位精准度。

Description

定位声源用户的方法、装置和计算机设备
本申请要求于2020年4月24日提交中国专利局、申请号为202010334984.2,发明名称为“定位声源用户的方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到人工智能及区块链领域,特别是涉及到定位声源用户的方法、装置和计算机设备。
背景技术
现有机器人系统一般只存在视觉或声音一种方式进行定位。但发明人意识到视觉定位对使用环境要求较高,需要有良好的光线环境,而且当使用者不在摄像头范围内时,该功能基本无法使用,视觉定位需要处理的数据量大,对机器人系统的运算能力有较高要求。声音定位时,精度较低,不能满足精确追踪的交互场景,在噪声嘈杂的环境中精度更低。所以,现有机器人定位系统不能满足各种场景下的精准定位的需求。
技术问题
本申请的主要目的为提供定位声源用户的方法,旨在解决现有机器人定位系统不能满足各种场景下的精准定位的需求的技术问题。
技术解决方案
本申请提出一种定位声源用户的方法,包括:
获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度;根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内;判断在机器人的视野范围内是否获取到指定用户的用户画像;若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型;接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致,其中,数据结果包括动作类型属于嘴部动作;若是,则判定指定方位的指定用户为声源用户。
本申请还提供了一种定位声源用户的装置,包括:
第一获取模块,用于获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;得到模块,用于根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度;旋转模块,用于根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内;第一判断模块,用于判断在机器人的视野范围内是否获取到指定用户的用户画像;第二获取模块,用于若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型;接收模块,用于接收VGG网络识别计算后输 出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致;判定模块,用于若是,则判定指定方位的指定用户为声源用户。
本申请还提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,处理器执行计算机程序时实现方法的步骤。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现的方法的步骤。
有益效果
本申请通过在视觉定位中通过将人的系列动作数据作为VGG网络的输入,通过动作数据提高区分的精准度,并将视觉定位和声音定位进行综合使用,以提高机器人定位说话的目标用户的精准度。
附图说明
图1本申请一实施例的定位声源用户的方法流程示意图;
图2本申请一实施例的定位声源用户的装置结构示意图;
图3本申请一实施例的计算机设备内部结构示意图。
本发明的最佳实施方式
参照图1,本申请一实施例的定位声源用户的方法,包括:
S1:获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位。
声源定位通过麦克风阵列实现。通过对阵列中的每个麦克风设置延迟参数,通过控制不同的延时参数,实现不同的方位指向,可以对定位的区域进行网格划分,每个网格点对各个麦克风在时域上进行延迟,然后求和计算麦克风阵列的声压,通过声压确定声源方位,即声源相对于机器人的方位位置,即指定方位。机器人中同时具备声源定位以及视觉定位,视觉中心线方位为视野范围内的中心位置。比如根据机器人选用的是单目结构还是双目结构来确定,单目结构中以经过单目中心垂直于机器人脸部所在平面的直线方向为视觉中心线方位;双目结构以经过双目连接线的中点垂直于机器人脸部所在平面的中垂线方向为视觉中心线方位。
S2:根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度。
空间区域跨度包括机器人当前的视觉中心线方位到指定方位的弧度范围对应的区域,从当前的视觉中心线方位逆时针方向旋转到指定方位时对应的弧度区域,或从当前的视觉中心线方位顺时针方向旋转到指定方位时对应的弧度区域。通过声源初步定位,以协助机器人快速调整视觉定位的方位,提高响应灵敏度和精准度。
S3:根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内。
指定方位位于机器人的视觉范围内,包括位于视觉范围内的任意位置,优选指定方位与视觉中心线方位重合,以提高视觉定位的精准性。旋转包括旋转配备摄像头的头部,或旋转机器人整个身体。旋转过程可通过控制机器人腰部和头部偏航角配合将摄像头对准说话者方位,即对准指定方位。
S4:判断在机器人的视野范围内是否获取到指定用户的用户画像。
用户画像包括头部画像,以便通过识别头部画像中的嘴部动作,对该用户是否在说话进行预估判断。
S5:若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型。
当存在头部画像时,则认为该用户可能在说话,通过进一步获取嘴部动作,并通过预设方式处理嘴部动作后,输入VGG网络对嘴部动作类型进行深度解析计算。预设方式处理包括将获取的嘴部动作视频信息,拼接成携带时间序列的单一的图片信息,以便被VGG网络识别。
S6:接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致,其中,数据结果包括动作类型属于嘴部动作。
S7:若是,则判定指定方位的指定用户为声源用户。
VGG网络输出的数据结果包括是否存在嘴部动作,比如图片信息中依据时间序列嘴部形态发生较大的变化,则认为存在嘴部动作,否则不存在。如果VGG网络判断指定方位处的指定用户存在嘴部动作,且同时声源定位指定的声源方位预指定方位一致,则确定指定用户为声源用户。通过结合视觉定位和声源定位的优点实现对声源用户的精准定位,可快速找到说话者,提高说话者与机器人的人机交互体验以及交互效果。本申请实施例通过声源定位的技术来确定目标用户的大概位置,快速给出定位结果;然后通过视觉定位对目标用户进行精确定位,在视觉定位中通过将人的系列动作数据作为VGG网络的输入,通过动作数据提高区分目标用户的精准度。动作数据在输入VGG网络前,要通过特定的数据处理方式,以便处理后的数据可被VGG网络识别并运算,排除仿真人或类似用户的物体对视觉定位的干扰,目标用户指视野范围内的指定用户。
进一步地,获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型的步骤S5,包括:
S51:获取指定用户在指定时间段内的动作数据,动作数据为连续的多帧动作序列;S52:将连续的多帧动作序列,通过
Figure PCTCN2020093425-appb-000001
合并拼接成一个静态图像数据,其中,p i∈R n,表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;S53:将静态图像数据输入至VGG网络进行识别计算。
本申请应用了人工智能领域中的图像、视频识别技术,其中,指定时间段指摄像头采集的嘴部动作视频的连续时间跨度。通过将摄像头采集的嘴部动作视频,拆成连续的多帧动作序列,并按照时间序列实现依次拼接,将嘴部动作视频形成一个静态图像数据,以便被VGG网络识别计算。每个人的行为可由一些关键点决定,包括嘴部动作,比如嘴部动作有15个关键点,则i=0至14。通 过对VGG网络的输入端进行改进,使其能够处理连续的多帧动作序列,实现识别嘴部动作。B i,k(t)表示变换矩阵,k表示维度,比如
Figure PCTCN2020093425-appb-000002
p(t)是t∈[t i,t i+1)时间内的输出结果,R n表示实数中的整数。
这个公式也可写为
Figure PCTCN2020093425-appb-000003
相当于最后任意时间段内t∈[t i,t i+1),这些用户的关键点的信息都是由多帧的运动关键点合成的,从而实现了把多帧连续运动序列合成输入的信息结构,VGG网络分类的结果也就可以针对运动的动作,M 6表示6*6的矩阵。
进一步地,获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型的步骤S5之前,包括:
S50a:判断机器人的视野范围内的指定用户的数量是否为两个及以上;S50b:若是,则根据Yolov3算法在机器人的视野范围对应的视野图中,选择出各指定用户分别对应的方块区域;S50c:分别截取各方块区域对应的指定时间段内的系列动作作为动作数据。
对于同一指定方位处或当前视野范围内存在多人的情况,本申请实施例先根据Yolov3算法用方框选择出多个人分别所处的位置,即的方块区域,然后再分别截取每个人的系列动作作为对应用户的动作数据,利用时间维度信息可以获取更高维度的特征量,提高分析精准度。Yolov3是一阶段End2End的目标检测器。Yolov3将输入图像分成S*S个格子,每个格子预测B个bounding box,每个boundingbox预测内容包括:Location(x,y,w,h)、Confidence Score和C个类别的概率,因此Yolov3输出层的channel数为S*S*B*(5+C)。Yolov3的loss函数有三部分组成:Location误差,Confidence误差和分类误差。
进一步地,根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度的步骤S2,包括:
S21:获取从视觉中心线方位顺时针旋转到指定方位时的第一区域跨度,以及从视觉中心线方位逆时针旋转到指定方位时的第二区域跨度;S22:比较第一区域跨度与第二区域跨度的大小;S23:当第一区域跨度大于第二区域跨度时,将第二区域跨度作为空间区域跨度,当第一区域跨度不大于第二区域跨度时,将第一区域跨度作为空间区域跨度。
本实施例以存在一个指定方位为例,当接收到指定方位处的声源发出声音 时,视觉中心线方位旋转到指定方位对应的方向,使指定方位位于旋转后的视野范围内,优选指定方位预旋转调整好的视觉中心线方位重合。为方便快速响应,控制以跨度小的弧度区域为待旋转的空间区域跨度。
进一步地,指定方位的数量为两个及以上,空间区域跨度包括两个及以上,根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度的步骤S2,包括:
S31:获取从视觉中心线方位顺时针旋转经过所有指定方位对应的第一总区域跨度,以及从视觉中心线方位逆时针旋转经过所有对应的第二总区域跨度;S32:比较第一总区域跨度与第二总区域跨度的大小;S33:当第一总区域跨度大于第二总区域跨度时,将第二总区域跨度作为空间区域跨度,当第一总区域跨度不大于第二总区域跨度时,将第一总区域跨度作为空间区域跨度。
本申请实施例以存在多个指定方位为例,即多个区域同时发出声音或接续发出声音,则需要对的多个区域依次进行视觉精准定位。首先根据多个指定方位分别到旋转前的视觉中心线方位的所有覆盖弧度区间,选出最大的覆盖弧度区间作为总区域跨度。以旋转前的视觉中心线方位为起点,顺时针旋转依次经过各指定方位的最大的覆盖弧度区间作为第一总区域跨度。以旋转前的视觉中心线方位为起点,逆时针旋转依次经过各指定方位的最大的覆盖弧度区间作为第二总区域跨度。通过选定旋转方位后,然后依次分析各指定方位处分别对应用户的动作数据,实现对说话者的精准定位。
进一步地,接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致的步骤S6,包括:
S61:分析数据结果是否包括嘴部的张合动作;S62:若是,则再次确定当前声源方位是否为指定方位;S63:若是,则判定声源方位与指定方位相一致,否则,不一致。
通过分析是否存在嘴部的张合动作,初步判断是否在说话,若初步判断在说话,则再次调用声源定位进行辅助分析,若声源定位和视觉定位均指向指定用户为说话者,则判定指定用户为说话者。即如果存在嘴部动作加上对指定用户的声音方位正确,则判定指定用户在说话。两者的判断指向不聚焦,则通过继续循环判断流程,寻找声源用户即说话者。比如存在嘴部动作但对指定用户的声音方位不是来源此方位。VGG只能处理静态图片信息,达到识别图片中标记点的特征,比如根据图片中标记点的特征进行水果种类识别等,无法直接通过VGG测定得到动作信息,如嘴部张合动作。本实施例通过将动作视频的多帧图片,拼接后输入VGG,根据VGG的输出数据得到图片中的标记点位置的变化轨迹,判断嘴部是否存在张合动作,并结合声源定位判断嘴部张合动作与声源定位的方位一致性,如果该方位捕获到的视频中用户嘴巴存在张合动作,且同时该方位也存在声源声音,则判定该用户为说话者,即声源用户。声源方位依然采用麦克风阵列声源定位技术进行确定。
进一步地,分析数据结果是否包括嘴部的张合动作的步骤S61之前,包括:
S60a:判断摄像头的聚焦条件相对于指定用户距离摄像头的距离是否正常; S60b:若是,则判断在聚焦条件下获取的用户画像的分辨率是否在预设范围内;S60c:若是,则控制VGG网络识别计算,否则终止计算。
优选地,为进一步保证动作数据的私密和安全性,动作数据还可以存储于一区块链的节点中。
需要说明的是,本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
另外,本方案还可应用于智慧交通领域中,从而推动智慧城市的建设。本实施例通过分辨率排除电子屏幕内的虚拟人物对现实说话者定位的干扰,由于电子屏幕具有反光性,同等距离、同等聚焦条件下,拍摄到的现实用户的图像或视频的分辨率,要远远高于拍摄到的电子屏幕中的虚拟用户的分辨率。当分辨率不满足要求,则直接终止VGG网络识别计算,输出声源方位是否与指定方位不相一致的结论。
参照图2,本申请一实施例的定位声源用户的装置,包括:
第一获取模块1,用于获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位。
声源定位通过麦克风阵列实现。通过对阵列中的每个麦克风设置延迟参数,通过控制不同的延时参数,实现不同的方位指向,可以对定位的区域进行网格划分,每个网格点对各个麦克风在时域上进行延迟,然后求和计算麦克风阵列的声压,通过声压确定声源方位,即声源相对于机器人的方位位置,即指定方位。机器人中同时具备声源定位以及视觉定位,视觉中心线方位为视野范围内的中心位置。比如根据机器人选用的是单目结构还是双目结构来确定,单目结构中以经过单目中心垂直于机器人脸部所在平面的直线方向为视觉中心线方位;双目结构以经过双目连接线的中点垂直于机器人脸部所在平面的中垂线方向为视觉中心线方位。
得到模块2,用于根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度。
空间区域跨度包括机器人当前的视觉中心线方位到指定方位的弧度范围对应的区域,从当前的视觉中心线方位逆时针方向旋转到指定方位时对应的弧度区域,或从当前的视觉中心线方位顺时针方向旋转到指定方位时对应的弧度区域。通过声源初步定位,以协助机器人快速调整视觉定位的方位,提高响应灵敏度和精准度。
旋转模块3,用于根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内。
指定方位位于机器人的视觉范围内,包括位于视觉范围内的任意位置,优选指定方位与视觉中心线方位重合,以提高视觉定位的精准性。旋转包括旋转 配备摄像头的头部,或旋转机器人整个身体。旋转过程可通过控制机器人腰部和头部偏航角配合将摄像头对准说话者方位,即对准指定方位。
第一判断模块4,用于判断在机器人的视野范围内是否获取到指定用户的用户画像。
用户画像包括头部画像,以便通过识别头部画像中的嘴部动作,对该用户是否在说话进行预估判断。
第二获取模块5,用于若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型。
当存在头部画像时,则认为该用户可能在说话,通过进一步获取嘴部动作,并通过预设方式处理嘴部动作后,输入VGG网络对嘴部动作类型进行深度解析计算。预设方式处理包括将获取的嘴部动作视频信息,拼接成携带时间序列的单一的图片信息,以便被VGG网络识别。
接收模块6,用于接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致,其中,数据结果包括动作类型属于嘴部动作。
判定模块7,用于若是,则判定指定方位的指定用户为声源用户。
VGG网络输出的数据结果包括是否存在嘴部动作,比如图片信息中依据时间序列嘴部形态发生较大的变化,则认为存在嘴部动作,否则不存在。如果VGG网络判断指定方位处的指定用户存在嘴部动作,且同时声源定位指定的声源方位预指定方位一致,则确定指定用户为声源用户。通过结合视觉定位和声源定位的优点实现对声源用户的精准定位,可快速找到说话者,提高说话者与机器人的人机交互体验以及交互效果。本申请实施例通过声源定位的技术来确定目标用户的大概位置,快速给出定位结果;然后通过视觉定位对目标用户进行精确定位,在视觉定位中通过将人的系列动作数据作为VGG网络的输入,通过动作数据提高区分目标用户的精准度。动作数据在输入VGG网络前,要通过特定的数据处理方式,以便处理后的数据可被VGG网络识别并运算,排除仿真人或类似用户的物体对视觉定位的干扰,目标用户指视野范围内的指定用户。
进一步地,第二获取模块5,包括:
第一获取单元,用于获取指定用户在指定时间段内的动作数据,动作数据为连续的多帧动作序列;拼接单元,用于将连续的多帧动作序列,通过
Figure PCTCN2020093425-appb-000004
合并拼接成一个静态图像数据,其中,p i∈R n表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;输入单元,用于将静态图像数据输入至VGG网络进行识别计算。
指定时间段指摄像头采集的嘴部动作视频的连续时间跨度。通过将摄像头采集的嘴部动作视频,拆成连续的多帧动作序列,并按照时间序列实现依次拼 接,将嘴部动作视频形成一个静态图像数据,以便被VGG网络识别计算。每个人的行为可由一些关键点决定,包括嘴部动作,比如嘴部动作有15个关键点,则i=0至14。通过对VGG网络的输入端进行改进,使其能够处理连续的多帧动作序列,实现识别嘴部动作。B i,k(t)表示变换矩阵,k表示维度,比如
Figure PCTCN2020093425-appb-000005
p(t)是t∈[t i,t i+1)时间内的输出结果,R n表示实数中的整数。
这个公式也可写为
Figure PCTCN2020093425-appb-000006
相当于最后任意时间段内t∈[t i,t i+1),这些用户的关键点的信息都是由多帧的运动关键点合成的,从而实现了把多帧连续运动序列合成输入的信息结构,VGG网络分类的结果也就可以针对运动的动作,M 6表示6*6的矩阵。
进一步地,定位声源用户的装置,包括:
第二判断模块,用于判断机器人的视野范围内的指定用户的数量是否为两个及以上;选择模块,用于若是,则根据Yolov3算法在机器人的视野范围对应的视野图中,选择出各指定用户分别对应的方块区域;截取模块,用于分别截取各方块区域对应的指定时间段内的系列动作作为动作数据。
对于同一指定方位处或当前视野范围内存在多人的情况,本申请实施例先根据Yolov3算法用方框选择出多个人分别所处的位置,即的方块区域,然后再分别截取每个人的系列动作作为对应用户的动作数据,利用时间维度信息可以获取更高维度的特征量,提高分析精准度。Yolov3是一阶段End2End的目标检测器。Yolov3将输入图像分成S*S个格子,每个格子预测B个bounding box,每个boundingbox预测内容包括:Location(x,y,w,h)、Confidence Score和C个类别的概率,因此Yolov3输出层的channel数为S*S*B*(5+C)。Yolov3的loss函数有三部分组成:Location误差,Confidence误差和分类误差。
进一步地,得到模块2,包括:
第二获取单元,用于获取从视觉中心线方位顺时针旋转到指定方位时的第一区域跨度,以及从视觉中心线方位逆时针旋转到指定方位时的第二区域跨度;第一比较单元,用于比较第一区域跨度与第二区域跨度的大小;第一作为单元,用于当第一区域跨度大于第二区域跨度时,将第二区域跨度作为空间区域跨度,当第一区域跨度不大于第二区域跨度时,将第一区域跨度作为空间区域跨度。
本实施例以存在一个指定方位为例,当接收到指定方位处的声源发出声音时,视觉中心线方位旋转到指定方位对应的方向,使指定方位位于旋转后的视 野范围内,优选指定方位预旋转调整好的视觉中心线方位重合。为方便快速响应,控制以跨度小的弧度区域为待旋转的空间区域跨度。
进一步地,另一实施例中,得到模块2,包括:
第三获取单元,用于获取从视觉中心线方位顺时针旋转经过所有指定方位对应的第一总区域跨度,以及从视觉中心线方位逆时针旋转经过所有对应的第二总区域跨度;第二比较单元,用于比较第一总区域跨度与第二总区域跨度的大小;第二作为单元,用于当第一总区域跨度大于第二总区域跨度时,将第二总区域跨度作为空间区域跨度,当第一总区域跨度不大于第二总区域跨度时,将第一总区域跨度作为空间区域跨度。
本申请实施例以存在多个指定方位为例,即多个区域同时发出声音或接续发出声音,则需要对的多个区域依次进行视觉精准定位。首先根据多个指定方位分别到旋转前的视觉中心线方位的所有覆盖弧度区间,选出最大的覆盖弧度区间作为总区域跨度。以旋转前的视觉中心线方位为起点,顺时针旋转依次经过各指定方位的最大的覆盖弧度区间作为第一总区域跨度。以旋转前的视觉中心线方位为起点,逆时针旋转依次经过各指定方位的最大的覆盖弧度区间作为第二总区域跨度。通过选定旋转方位后,然后依次分析各指定方位处分别对应用户的动作数据,实现对说话者的精准定位。
进一步地,接收模块6,包括:
分析单元,用于分析数据结果是否包括嘴部的张合动作;确定单元,用于若是,则再次确定当前声源方位是否为指定方位;判定单元,用于若是,则判定声源方位与指定方位相一致,否则,不一致。
通过分析是否存在嘴部的张合动作,初步判断是否在说话,若初步判断在说话,则再次调用声源定位进行辅助分析,若声源定位和视觉定位均指向指定用户为说话者,则判定指定用户为说话者。即如果存在嘴部动作加上对指定用户的声音方位正确,则判定指定用户在说话。两者的判断指向不聚焦,则通过继续循环判断流程,寻找声源用户即说话者。比如存在嘴部动作但对指定用户的声音方位不是来源此方位。VGG只能处理静态图片信息,达到识别图片中标记点的特征,比如根据图片中标记点的特征进行水果种类识别等,无法直接通过VGG测定得到动作信息,如嘴部张合动作。本实施例通过将动作视频的多帧图片,拼接后输入VGG,根据VGG的输出数据得到图片中的标记点位置的变化轨迹,判断嘴部是否存在张合动作,并结合声源定位判断嘴部张合动作与声源定位的方位一致性,如果该方位捕获到的视频中用户嘴巴存在张合动作,且同时该方位也存在声源声音,则判定该用户为说话者,即声源用户。声源方位依然采用麦克风阵列声源定位技术进行确定。
进一步地,接收模块6,包括:
第一判断单元,用于判断摄像头的聚焦条件相对于指定用户距离摄像头的距离是否正常;第二判断单元,用于若是,则判断在聚焦条件下获取的用户画像的分辨率是否在预设范围内;控制单元,用于若是,则控制VGG网络识别计算,否则终止计算。
本实施例通过分辨率排除电子屏幕内的虚拟人物对现实说话者定位的干扰,由于电子屏幕具有反光性,同等距离、同等聚焦条件下,拍摄到的现实用户的图像或视频的分辨率,要远远高于拍摄到的电子屏幕中的虚拟用户的分辨率。当分辨率不满足要求,则直接终止VGG网络识别计算,输出声源方位是否与指定方位不相一致的结论。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储定位声源用户的过程需要的所有数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现定位声源用户的方法。
处理器执行定位声源用户的方法,包括:获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度;根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内;判断在机器人的视野范围内是否获取到指定用户的用户画像;若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型;接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致,其中,数据结果包括动作类型属于嘴部动作;若是,则判定指定方位的指定用户为声源用户。
计算机设备,通过在视觉定位中通过将人的系列动作数据作为VGG网络的输入,通过动作数据提高区分的精准度,并将视觉定位和声音定位进行综合使用,以提高机器人定位说话的目标用户的精准度。
本领域技术人员可理解,图3示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现定位声源用户的方法,包括:获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;根据指定方位以及视觉中心线方位,得到预旋转的空间区域跨度;根据预旋转的空间区域跨度控制机器人旋转,旋转至指定方位位于机器人的视觉范围内;判断在机器人的视野范围内是否获取到指定用户的用户画像;若是,则获取指定用户的动作数据,并经过预设方式处理,得到处理结果,并将处理结果输入至VGG网络进行识别计算,以得到动作数据对应的动作类型;接收VGG网络识别计算后输出的数据结果,并根据VGG网络的数据结果判断声源方位是否与指定方位相一致,其中,数据结果包括动作类型属于嘴部动作;若是, 则判定指定方位的指定用户为声源用户。
计算机可读存储介质,通过在视觉定位中通过将人的系列动作数据作为VGG网络的输入,通过动作数据提高区分的精准度,并将视觉定位和声音定位进行综合使用,以提高机器人定位说话的目标用户的精准度。
本领域普通技术人员可以理解实现实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成的,计算机程序可存储于计算机可读取存储介质中,该计算机程序在执行时,可包括如各方法的实施例的流程,计算机可读存储介质可以是非易失性,也可以是易失性。本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。

Claims (20)

  1. 一种定位声源用户的方法,包括:
    获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;
    根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度;
    根据所述预旋转的空间区域跨度控制机器人旋转,旋转至所述指定方位位于所述机器人的视觉范围内;
    判断在所述机器人的视野范围内是否获取到指定用户的用户画像;
    若是,则获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型;
    接收所述VGG网络识别计算后输出的数据结果,并根据所述VGG网络的数据结果判断声源方位是否与所述指定方位相一致,其中,所述数据结果包括所述动作类型属于嘴部动作;
    若是,则判定所述指定方位的指定用户为声源用户。
  2. 根据权利要求1所述的定位声源用户的方法,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型的步骤,包括:
    获取所述指定用户在指定时间段内的动作数据,所述动作数据为连续的多帧动作序列;
    将连续的多帧所述动作序列,通过
    Figure PCTCN2020093425-appb-100001
    合并拼接成一个静态图像数据,其中,p i∈R n,表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;
    将所述静态图像数据输入至VGG网络进行识别计算。
  3. 根据权利要求1所述的定位声源用户的方法,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型的步骤之前,包括:
    判断所述机器人的视野范围内的所述指定用户数量是否为两个及以上;
    若是,则根据Yolov3算法在所述机器人的视野范围对应的视野图中,选择出各所述指定用户分别对应的方块区域;
    分别截取各所述方块区域对应的所述指定时间段内的系列动作作为所述动作数据。
  4. 根据权利要求1所述的定位声源用户的方法,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转到所述指定方位时的第一区域跨度,以及从所述视觉中心线方位逆时针旋转到所述指定方位时的第二区域跨度;
    比较所述第一区域跨度与所述第二区域跨度的大小;
    当所述第一区域跨度大于所述第二区域跨度时,将所述第二区域跨度作为所述空间区域跨度,当所述第一区域跨度不大于所述第二区域跨度时,将所述第一区域跨度作为所述空间区域跨度。
  5. 根据权利要求1所述的定位声源用户的方法,所述指定方位的数量为两个及以上,所述空间区域跨度包括两个及以上,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转经过所有所述指定方位对应的第一总区域跨度,以及从所述视觉中心线方位逆时针旋转经过所有所述对应的第二总区域跨度;
    比较所述第一总区域跨度与所述第二总区域跨度的大小;
    当所述第一总区域跨度大于所述第二总区域跨度时,将所述第二总区域跨度作为所述空间区域跨度,当所述第一总区域跨度不大于所述第二总区域跨度时,将所述第一总区域跨度作为所述空间区域跨度。
  6. 根据权利要求1所述的定位声源用户的方法,所述接收所述VGG网络识别计算后输出的数据结果,并根据所述VGG网络的数据结果判断声源方位是否与所述指定方位相一致的步骤,包括:
    分析所述数据结果是否包括嘴部的张合动作;
    若是,则再次确定当前声源方位是否为所述指定方位;
    若是,则判定声源方位与所述指定方位相一致,否则,不一致。
  7. 根据权利要求6所述的定位声源用户的方法,所述分析所述数据结果是否包括嘴部的张合动作的步骤之前,包括:
    判断摄像头的聚焦条件相对于所述指定用户距离所述摄像头的距离是否正常;
    若是,则判断在所述聚焦条件下获取的所述用户画像的分辨率是否在预设范围内;
    若是,则控制所述VGG网络识别计算,否则终止计算。
  8. 一种定位声源用户的装置,包括:
    第一获取模块,用于获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;
    得到模块,用于根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度;
    旋转模块,用于根据所述预旋转的空间区域跨度控制机器人旋转,旋转至所述指定方位位于所述机器人的视觉范围内;
    第一判断模块,用于判断在所述机器人的视野范围内是否获取到指定用户的用户画像;
    第二获取模块,用于若是,则获取所述指定用户的动作数据,并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型;
    接收模块,用于接收所述VGG网络识别计算后输出的数据结果,并根据所述VGG网络的数据结果判断声源方位是否与所述指定方位相一致;
    判定模块,用于若是,则判定所述指定方位的指定用户为声源用户。
  9. 根据权利要求8所述的定位声源用户的装置,所述第二获取模块,包括:
    第一获取单元,用于获取所述指定用户在指定时间段内的动作数据,所述动作数据为连续的多帧动作序列;
    拼接单元,用于将连续的多帧所述动作序列,通过
    Figure PCTCN2020093425-appb-100002
    合并拼接成一个静态图像数据,其中,p i∈R n,表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;
    输入单元,用于将所述静态图像数据输入至VGG网络进行识别计算。
  10. 根据权利要求8所述的定位声源用户的装置,包括:
    第二判断模块,用于判断所述机器人的视野范围内的所述指定用户数量是否为两个及以上;
    选择模块,用于若所述指定用户数量为两个及以上,则根据Yolov3算法在所述机器人的视野范围对应的视野图中,选择出各所述指定用户分别对应的方块区域;
    截取模块,用于分别截取各所述方块区域对应的所述指定时间段内的系列动作作为所述动作数据。
  11. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现定位声源用户的方法,其中,定位声源用户的方法,包括:
    获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;
    根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度;
    根据所述预旋转的空间区域跨度控制机器人旋转,旋转至所述指定方位位于所述机器人的视觉范围内;
    判断在所述机器人的视野范围内是否获取到指定用户的用户画像;
    若是,则获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型;
    接收所述VGG网络识别计算后输出的数据结果,并根据所述VGG网络的数据结果判断声源方位是否与所述指定方位相一致,其中,所述数据结果包括所述动作类型属于嘴部动作;
    若是,则判定所述指定方位的指定用户为声源用户。
  12. 根据权利要求11所述的计算机设备,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络 进行识别计算,以得到所述动作数据对应的动作类型的步骤,包括:
    获取所述指定用户在指定时间段内的动作数据,所述动作数据为连续的多帧动作序列;
    将连续的多帧所述动作序列,通过
    Figure PCTCN2020093425-appb-100003
    合并拼接成一个静态图像数据,其中,p i∈R n,表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;
    将所述静态图像数据输入至VGG网络进行识别计算。
  13. 根据权利要求11所述的计算机设备,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型的步骤之前,包括:
    判断所述机器人的视野范围内的所述指定用户数量是否为两个及以上;
    若是,则根据Yolov3算法在所述机器人的视野范围对应的视野图中,选择出各所述指定用户分别对应的方块区域;
    分别截取各所述方块区域对应的所述指定时间段内的系列动作作为所述动作数据。
  14. 根据权利要求11所述的计算机设备,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转到所述指定方位时的第一区域跨度,以及从所述视觉中心线方位逆时针旋转到所述指定方位时的第二区域跨度;
    比较所述第一区域跨度与所述第二区域跨度的大小;
    当所述第一区域跨度大于所述第二区域跨度时,将所述第二区域跨度作为所述空间区域跨度,当所述第一区域跨度不大于所述第二区域跨度时,将所述第一区域跨度作为所述空间区域跨度。
  15. 根据权利要求11所述的计算机设备,所述指定方位的数量为两个及以上,所述空间区域跨度包括两个及以上,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转经过所有所述指定方位对应的第一总区域跨度,以及从所述视觉中心线方位逆时针旋转经过所有所述对应的第二总区域跨度;
    比较所述第一总区域跨度与所述第二总区域跨度的大小;
    当所述第一总区域跨度大于所述第二总区域跨度时,将所述第二总区域跨度作为所述空间区域跨度,当所述第一总区域跨度不大于所述第二总区域跨度时,将所述第一总区域跨度作为所述空间区域跨度。
  16. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现定位声源用户的方法,其中,定位声源用户的方法,包括:
    获取声源定位识别到的声音来源对应的指定方位,以及机器人当前所处空间位置对应的视觉中心线方位;
    根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度;
    根据所述预旋转的空间区域跨度控制机器人旋转,旋转至所述指定方位位于所述机器人的视觉范围内;
    判断在所述机器人的视野范围内是否获取到指定用户的用户画像;
    若是,则获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型;
    接收所述VGG网络识别计算后输出的数据结果,并根据所述VGG网络的数据结果判断声源方位是否与所述指定方位相一致,其中,所述数据结果包括所述动作类型属于嘴部动作;
    若是,则判定所述指定方位的指定用户为声源用户。
  17. 根据权利要求16所述的计算机可读存储介质,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型的步骤,包括:
    获取所述指定用户在指定时间段内的动作数据,所述动作数据为连续的多帧动作序列;
    将连续的多帧所述动作序列,通过
    Figure PCTCN2020093425-appb-100004
    合并拼接成一个静态图像数据,其中,p i∈R n,表示t时刻的关键点,i表示关键点的序号;B i,k(t)表示变换矩阵,k表示维度;p(t)是t∈[t i,t i+1)时间内输出的静态图像数据;
    将所述静态图像数据输入至VGG网络进行识别计算。
  18. 根据权利要求16所述的计算机可读存储介质,所述获取所述指定用户的动作数据并经过预设方式处理,得到处理结果,并将所述处理结果输入至VGG网络进行识别计算,以得到所述动作数据对应的动作类型的步骤之前,包括:
    判断所述机器人的视野范围内的所述指定用户数量是否为两个及以上;
    若是,则根据Yolov3算法在所述机器人的视野范围对应的视野图中,选择出各所述指定用户分别对应的方块区域;
    分别截取各所述方块区域对应的所述指定时间段内的系列动作作为所述动作数据。
  19. 根据权利要求16所述的计算机可读存储介质,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转到所述指定方位时的第一区域跨度,以及从所述视觉中心线方位逆时针旋转到所述指定方位时的第二区域跨度;
    比较所述第一区域跨度与所述第二区域跨度的大小;
    当所述第一区域跨度大于所述第二区域跨度时,将所述第二区域跨度作为所述空间区域跨度,当所述第一区域跨度不大于所述第二区域跨度时,将所述第一区域跨度作为所述空间区域跨度。
  20. 根据权利要求16所述的计算机可读存储介质,所述指定方位的数量为两个及以上,所述空间区域跨度包括两个及以上,所述根据所述指定方位以及所述视觉中心线方位,得到预旋转的空间区域跨度的步骤,包括:
    获取从所述视觉中心线方位顺时针旋转经过所有所述指定方位对应的第一总区域跨度,以及从所述视觉中心线方位逆时针旋转经过所有所述对应的第二总区域跨度;
    比较所述第一总区域跨度与所述第二总区域跨度的大小;
    当所述第一总区域跨度大于所述第二总区域跨度时,将所述第二总区域跨度作为所述空间区域跨度,当所述第一总区域跨度不大于所述第二总区域跨度时,将所述第一总区域跨度作为所述空间区域跨度。
PCT/CN2020/093425 2020-04-24 2020-05-29 定位声源用户的方法、装置和计算机设备 WO2021212608A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010334984.2 2020-04-24
CN202010334984.2A CN111650558B (zh) 2020-04-24 2020-04-24 定位声源用户的方法、装置和计算机设备

Publications (1)

Publication Number Publication Date
WO2021212608A1 true WO2021212608A1 (zh) 2021-10-28

Family

ID=72340980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093425 WO2021212608A1 (zh) 2020-04-24 2020-05-29 定位声源用户的方法、装置和计算机设备

Country Status (2)

Country Link
CN (1) CN111650558B (zh)
WO (1) WO2021212608A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762219A (zh) * 2021-11-03 2021-12-07 恒林家居股份有限公司 一种移动会议室内人物识别方法、系统和存储介质
US20220013080A1 (en) * 2018-10-29 2022-01-13 Goertek Inc. Directional display method and apparatus for audio device and audio device
CN114594892A (zh) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 远程交互方法、远程交互设备以及计算机存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2145935A (en) * 1983-09-05 1985-04-11 Tomy Kogyo Co Voice recognition toy
CN103235645A (zh) * 2013-04-25 2013-08-07 上海大学 立地式显示界面自适应跟踪调节装置及方法
CN106970698A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 家用智能设备
CN209356668U (zh) * 2018-11-23 2019-09-06 中国科学院电子学研究所 声源定位识别装置
US20190344428A1 (en) * 2019-03-08 2019-11-14 Lg Electronics Inc. Robot
CN110691196A (zh) * 2019-10-30 2020-01-14 歌尔股份有限公司 一种音频设备的声源定位的方法及音频设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295016B (zh) * 2008-06-13 2011-04-27 河北工业大学 一种声源自主搜寻定位方法
CN105184214B (zh) * 2015-07-20 2019-02-01 北京进化者机器人科技有限公司 一种基于声源定位和人脸检测的人体定位方法和系统
CN105760824B (zh) * 2016-02-02 2019-02-01 北京进化者机器人科技有限公司 一种运动人体跟踪方法和系统
US10474883B2 (en) * 2016-11-08 2019-11-12 Nec Corporation Siamese reconstruction convolutional neural network for pose-invariant face recognition
CN110569808A (zh) * 2019-09-11 2019-12-13 腾讯科技(深圳)有限公司 活体检测方法、装置及计算机设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2145935A (en) * 1983-09-05 1985-04-11 Tomy Kogyo Co Voice recognition toy
CN103235645A (zh) * 2013-04-25 2013-08-07 上海大学 立地式显示界面自适应跟踪调节装置及方法
CN106970698A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 家用智能设备
CN209356668U (zh) * 2018-11-23 2019-09-06 中国科学院电子学研究所 声源定位识别装置
US20190344428A1 (en) * 2019-03-08 2019-11-14 Lg Electronics Inc. Robot
CN110691196A (zh) * 2019-10-30 2020-01-14 歌尔股份有限公司 一种音频设备的声源定位的方法及音频设备

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220013080A1 (en) * 2018-10-29 2022-01-13 Goertek Inc. Directional display method and apparatus for audio device and audio device
US11551633B2 (en) * 2018-10-29 2023-01-10 Goeriek Inc. Directional display method and apparatus for audio device and audio device
CN113762219A (zh) * 2021-11-03 2021-12-07 恒林家居股份有限公司 一种移动会议室内人物识别方法、系统和存储介质
CN114594892A (zh) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 远程交互方法、远程交互设备以及计算机存储介质
CN114594892B (zh) * 2022-01-29 2023-11-24 深圳壹秘科技有限公司 远程交互方法、远程交互设备以及计算机存储介质

Also Published As

Publication number Publication date
CN111650558A (zh) 2020-09-11
CN111650558B (zh) 2023-10-10

Similar Documents

Publication Publication Date Title
WO2021212608A1 (zh) 定位声源用户的方法、装置和计算机设备
CN109902630B (zh) 一种注意力判断方法、装置、系统、设备和存储介质
US11087476B2 (en) Trajectory tracking method and apparatus, computer device, and storage medium
Xu et al. Joint head pose estimation and face alignment framework using global and local CNN features
WO2020103647A1 (zh) 物体关键点的定位方法、图像处理方法、装置及存储介质
EP3373202B1 (en) Verification method and system
US10691927B2 (en) Image deformation processing method and apparatus, and computer storage medium
WO2022156640A1 (zh) 一种图像的视线矫正方法、装置、电子设备、计算机可读存储介质及计算机程序产品
US20200272806A1 (en) Real-Time Tracking of Facial Features in Unconstrained Video
EP3839807A1 (en) Facial landmark detection method and apparatus, computer device and storage medium
US20220180534A1 (en) Pedestrian tracking method, computing device, pedestrian tracking system and storage medium
KR20220000491A (ko) 모듈화된 인공지능 모델 플랫폼 서비스 제공 방법, 장치 및 컴퓨터프로그램
CN110109535A (zh) 增强现实生成方法及装置
US20210350126A1 (en) Iris authentication device, iris authentication method, and recording medium
Shao et al. Improving head pose estimation with a combined loss and bounding box margin adjustment
CN111832561B (zh) 基于计算机视觉的字符序列识别方法、装置、设备和介质
Ho et al. An analytic solution for the pose determination of human faces from a monocular image
CN112017212A (zh) 人脸关键点跟踪模型的训练、跟踪方法及系统
CN111881740A (zh) 人脸识别方法、装置、电子设备及介质
CN111476151A (zh) 眼球检测方法、装置、设备及存储介质
CN117455989A (zh) 室内场景slam追踪方法、装置、头戴式设备及介质
CN113194281A (zh) 视频解析方法、装置、计算机设备和存储介质
KR20220000493A (ko) 인공지능 모듈 플랫폼 서비스 제공 방법
WO2021051510A1 (zh) 生成人脸图像的方法、装置、计算机设备及存储介质
WO2021218020A1 (zh) 车辆损伤图片处理方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20932543

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20932543

Country of ref document: EP

Kind code of ref document: A1