CN111650558A

CN111650558A - Method, device and computer equipment for positioning sound source user

Info

Publication number: CN111650558A
Application number: CN202010334984.2A
Authority: CN
Inventors: 龚连银; 苏雄飞; 周宝; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-09-11
Anticipated expiration: 2040-04-24
Also published as: WO2021212608A1; CN111650558B

Abstract

The application relates to artificial intelligence and block chain technology, and discloses a method for positioning a sound source user, which comprises the following steps: acquiring an appointed orientation corresponding to a sound source identified by sound source positioning and a visual central line orientation corresponding to a current spatial position of the robot; obtaining a pre-rotation space region span according to the designated direction and the visual central line direction; controlling the robot to rotate according to the span of the pre-rotated space region until the robot rotates to the specified direction and is positioned in the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if yes, acquiring action data of the designated user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into the VGG network to perform identification calculation to obtain an action type; receiving a data result output after the VGG network identification calculation, and judging whether the sound source position is consistent with the designated position according to the data result of the VGG network; and if so, determining that the designated user in the designated direction is the sound source user, and realizing accurate positioning.

Description

Method, device and computer equipment for positioning sound source user

Technical Field

The present application relates to the field of computers, and more particularly, to a method, an apparatus, and a computer device for locating a sound source user.

Background

Existing robotic systems typically only have one way, visual or audio, to locate. However, the visual positioning has high requirements on the use environment, a good light environment is required, and when a user is not in the range of the camera, the function is basically unavailable, the data volume required to be processed by the visual positioning is large, and the computing capability of the robot system is high. When sound is positioned, the precision is lower, the interactive scene of accurate tracking cannot be met, and the precision is lower in a noisy environment. Therefore, the existing robot positioning system cannot meet the requirement of accurate positioning in various scenes.

Disclosure of Invention

The application mainly aims to provide a method for positioning a sound source user, and aims to solve the technical problem that the existing robot positioning system cannot meet the requirement of accurate positioning in various scenes.

The application provides a method for positioning a sound source user, which comprises the following steps:

acquiring a designated position corresponding to a sound source identified by sound source positioning and a visual center line position corresponding to a current spatial position of the robot;

obtaining a pre-rotation space region span according to the designated orientation and the visual central line orientation;

controlling the robot to rotate according to the pre-rotated space region span until the designated orientation is within the visual range of the robot;

judging whether a user portrait of a specified user is acquired in the visual field range of the robot;

if so, acquiring the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data;

receiving a data result output after the VGG network identification calculation, and judging whether a sound source position is consistent with the specified position according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action;

and if so, judging that the designated user of the designated direction is the sound source user.

Preferably, the step of obtaining the action data of the specified user, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for identification calculation to obtain an action type corresponding to the action data includes:

acquiring action data of the designated user in a designated time period, wherein the action data is a continuous multi-frame action sequence;

successive frames of said sequence of actions are passed

Merging and splicing into one static image data, wherein p_i∈RⁿI represents the key point at the time t, and i represents the serial number of the key point; b is_i,k(t) denotes a transformation matrix, k denotes a dimension, p (t) is t ∈ [ t [ [ t ]_i,t_i+1) Static image data output over time;

and inputting the static image data into a VGG network for identification calculation.

Preferably, before the step of obtaining the action data of the specified user, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to the VGG network for identification calculation to obtain an action type corresponding to the action data, the method includes:

judging whether the number of the appointed users in the visual field range of the robot is two or more;

if yes, selecting square areas corresponding to the designated users in a visual field graph corresponding to the visual field range of the robot according to a Yolov3 algorithm;

and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

Preferably, the step of obtaining a pre-rotated spatial region span according to the designated orientation and the visual centerline orientation comprises:

obtaining a first region span when rotated clockwise from the visual centerline orientation to the specified orientation and a second region span when rotated counterclockwise from the visual centerline orientation to the specified orientation;

comparing the size of the first region span and the second region span;

when the first region span is greater than the second region span, the second region span is taken as the spatial region span, and when the first region span is not greater than the second region span, the first region span is taken as the spatial region span.

Preferably, the number of the designated orientations is two or more, the spatial region span includes two or more, and the step of obtaining the pre-rotated spatial region span according to the designated orientations and the visual centerline orientation includes:

obtaining a first total region span corresponding to clockwise rotation from the visual centerline orientation through all of the designated orientations, and a second total region span corresponding to counterclockwise rotation from the visual centerline orientation through all of the designated orientations;

comparing the size of the first total region span and the second total region span;

when the first total region span is greater than the second total region span, the second total region span is taken as the spatial region span, and when the first total region span is not greater than the second total region span, the first total region span is taken as the spatial region span.

Preferably, the step of receiving the data result output after the VGG network identification calculation, and determining whether the sound source position is consistent with the specified position according to the data result of the VGG network includes:

analyzing whether the data result comprises opening and closing actions of the mouth;

if yes, determining whether the current sound source position is the designated position again;

if so, judging that the sound source direction is consistent with the specified direction, otherwise, judging that the sound source direction is inconsistent with the specified direction.

Preferably, the step of analyzing whether the data result includes mouth opening and closing actions is preceded by the steps of:

judging whether the focusing condition of the camera is normal relative to the distance from the appointed user to the camera;

if yes, judging whether the resolution of the user portrait acquired under the focusing condition is within a preset range;

and if so, controlling the VGG network to identify the calculation, otherwise, terminating the calculation. The present application further provides an apparatus for locating a sound source user, comprising:

the first acquisition module is used for acquiring the designated position corresponding to the sound source identified by sound source positioning and the visual center line position corresponding to the current spatial position of the robot;

the obtaining module is used for obtaining pre-rotation space region span according to the designated orientation and the visual central line orientation;

the rotation module is used for controlling the robot to rotate according to the pre-rotated space region span until the specified direction is located in the visual range of the robot;

the first judgment module is used for judging whether a user portrait of a specified user is acquired in the visual field range of the robot;

the second acquisition module is used for acquiring the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into the VGG network for identification calculation to obtain an action type corresponding to the action data;

the receiving module is used for receiving the data result output after the VGG network identification calculation, and judging whether the sound source position is consistent with the specified position according to the data result of the VGG network;

and the judging module is used for judging that the appointed user of the appointed direction is the sound source user if the appointed direction is positive.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.

According to the method and the device, the serial action data of the human is used as the input of the VGG network in the visual positioning, the distinguishing precision is improved through the action data, the visual positioning and the sound positioning are comprehensively used, and the precision of the robot for positioning the speaking target user is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for locating a sound source user according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for locating a sound source user according to an embodiment of the present application;

fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, a method for locating a sound source user according to an embodiment of the present application includes:

s1: and acquiring the designated position corresponding to the sound source identified by sound source positioning and the visual central line position corresponding to the current spatial position of the robot.

The sound source localization is achieved by a microphone array. The method comprises the steps of setting delay parameters for each microphone in an array, realizing different azimuth directions by controlling different delay parameters, carrying out grid division on a positioned area, delaying each microphone in a time domain by each grid point, summing up and calculating sound pressure of the microphone array, and determining the azimuth of a sound source through the sound pressure, namely the azimuth position of the sound source relative to a robot, namely an appointed azimuth. The robot has both sound source positioning and visual positioning, and the visual center line orientation is the center position in the visual field range. For example, the robot is determined according to whether the robot selects a monocular structure or a binocular structure, and the direction of a straight line which passes through the center of the monocular and is perpendicular to the plane where the face of the robot is located is taken as the direction of a visual center line in the monocular structure; the binocular structure takes the direction of a perpendicular bisector passing through the midpoint of the binocular connecting line and perpendicular to the plane where the face of the robot is located as the direction of the visual center line.

S2: and obtaining the pre-rotation space region span according to the designated orientation and the visual central line orientation.

The space region span comprises a region corresponding to the radian range from the current visual central line position of the robot to the designated position, a radian region corresponding to the rotation from the counterclockwise direction of the current visual central line position to the designated position, or a radian region corresponding to the rotation from the clockwise direction of the current visual central line position to the designated position. The robot is assisted to rapidly adjust the visual positioning direction by primarily positioning the sound source, and the response sensitivity and accuracy are improved.

S3: and controlling the robot to rotate according to the pre-rotated space region span until the specified orientation is located in the visual range of the robot.

The designated orientation is located in the visual range of the robot and comprises any position located in the visual range, and the designated orientation is preferably coincident with the orientation of the visual center line so as to improve the accuracy of visual positioning. The rotating includes rotating a head equipped with a camera, or rotating the entire body of the robot. The rotation process can align the camera to the speaker orientation, namely to the designated orientation by controlling the waist and head yaw angle of the robot to match.

S4: and judging whether the user portrait of the specified user is acquired in the visual field range of the robot.

The user image includes a head image, so that whether the user is speaking can be estimated by recognizing the mouth movement in the head image.

S5: and if so, acquiring the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into the VGG network for identification calculation to obtain an action type corresponding to the action data.

When the head portrait exists, the user is considered to possibly speak, the mouth movement is further acquired, and after the mouth movement is processed in a preset mode, the mouth movement is input into a VGG network to perform deep analysis calculation on the mouth movement type. The preset mode processing comprises the step of splicing the acquired mouth motion video information into single picture information carrying a time sequence so as to be identified by the VGG network.

S6: and receiving a data result output after the VGG network identification calculation, and judging whether the sound source position is consistent with the specified position according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action.

S7: and if so, judging that the designated user of the designated direction is the sound source user.

The data result output by the VGG network includes whether there is a mouth motion, for example, if there is a large change in the mouth shape in the picture information according to the time sequence, it is considered that there is a mouth motion, otherwise, there is no mouth motion. And if the VGG network judges that the mouth movement exists in the designated user at the designated position and the pre-designated positions of the sound source position designated by the sound source positioning are consistent, determining that the designated user is the sound source user. The accurate positioning of the sound source user is realized by combining the advantages of visual positioning and sound source positioning, the speaker can be quickly found, and the human-computer interaction experience and the interaction effect of the speaker and the robot are improved. The method and the device for positioning the target user determine the approximate position of the target user through a sound source positioning technology, and quickly give a positioning result; and then, the target user is accurately positioned through visual positioning, and the accuracy of distinguishing the target user is improved through the action data by taking the series of action data of a person as the input of the VGG network in the visual positioning. Before the action data is input into the VGG network, the action data is processed in a specific data processing mode, so that the processed data can be identified and calculated by the VGG network, the interference of a dummy or an object similar to a user on visual positioning is eliminated, and the target user refers to a designated user in a visual field range.

Further, the step S5 of acquiring the motion data of the specified user, performing preset processing to obtain a processing result, and inputting the processing result to the VGG network for identification calculation to obtain the motion type corresponding to the motion data includes:

s51: acquiring action data of the designated user in a designated time period, wherein the action data is a continuous multi-frame action sequence;

s52: successive frames of said sequence of actions are passed

Merging and splicing into one static image data, wherein p_i∈RⁿI represents the key point at time t, i represents the serial number of the key point; b is_i,k(t) denotes a transformation matrix, k denotes a dimension, p (t) is t ∈ [ t [ [ t ]_i,t_i+1) Static image data output over time;

s53: and inputting the static image data into a VGG network for identification calculation.

The method and the device have the advantages that the image and video identification technology in the field of artificial intelligence is applied, wherein the specified time period refers to the continuous time span of the mouth action video acquired by the camera. The mouth action video collected by the camera is split into continuous multi-frame action sequences, sequential splicing is realized according to the time sequence, and the mouth action video forms static image data so as to be identified and calculated by the VGG network. Each person's behavior may be determined by a number of key points, including mouth movements, e.g., 15 key points for mouth movements, i ═ 0 to 14. Through improving the input end of the VGG network, the continuous multi-frame action sequence can be processed, and the mouth action is recognized. B is_i,k(t) denotes a transformation matrix and k denotes a dimension, e.g.

p (t) is t ∈ [ t_i,t_i+1) Output result in time, the above-mentioned RⁿRepresenting an integer in a real number.

This formula can also be written as

Corresponding to t ∈ [ t ] in the last arbitrary time period_i,t_i+1) The key point information of the users is synthesized by the motion key points of multiple frames, thereby realizing the information structure of synthesizing and inputting the continuous motion sequences of multiple frames, and the result of VGG network classification can be aimed at the motion, M₆A matrix of 6 x 6 is represented.

Further, before the step S5 of acquiring the motion data of the specified user, performing a preset processing to obtain a processing result, and inputting the processing result to a VGG network for performing identification calculation to obtain a motion type corresponding to the motion data, the method includes:

s50a, judging whether the number of the specified users in the visual field range of the robot is two or more;

s50b, if yes, selecting square areas corresponding to the designated users in the visual field graph corresponding to the visual field range of the robot according to a Yolov3 algorithm;

s50 c: and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

For the situation that multiple persons exist in the same designated position or the current view field, in the embodiment of the application, the positions of the multiple persons, namely the square areas, are selected by the squares according to the Yolov3 algorithm, then the series of actions of each person are respectively intercepted as the action data of the corresponding user, the characteristic quantity with higher dimensionality can be obtained by using the time dimension information, and the analysis accuracy is improved. Yolov3 is a target detector for a stage End2 End. Yolov3 divides the input image into S × S lattices, each lattice predicts B bounding boxes, each bounding box predicts the probability of Location (x, y, w, h), Confidence Score and C categories, so the channel number of Yolov3 output layer is S × S B (5+ C). The loss function of Yolov3 consists of three parts: location error, Confidence error and classification error.

Further, the step S2 of obtaining a pre-rotated spatial region span according to the designated orientation and the visual centerline orientation includes:

s21: obtaining a first region span when rotated clockwise from the visual centerline orientation to the specified orientation and a second region span when rotated counterclockwise from the visual centerline orientation to the specified orientation;

s22: comparing the size of the first region span and the second region span;

s23: when the first region span is greater than the second region span, the second region span is taken as the spatial region span, and when the first region span is not greater than the second region span, the first region span is taken as the spatial region span.

In this embodiment, taking the existence of a designated azimuth as an example, when a sound source at the designated azimuth is received to emit a sound, the visual center line azimuth is rotated to the direction corresponding to the designated azimuth, so that the designated azimuth is located in the rotated visual field, and it is preferable that the visual center lines with the adjusted designated azimuth are overlapped in advance. In order to facilitate quick response, the radian area with small span is controlled to be the span of the space area to be rotated.

s31: obtaining a first total region span corresponding to clockwise rotation from the visual centerline orientation through all of the designated orientations, and a second total region span corresponding to counterclockwise rotation from the visual centerline orientation through all of the designated orientations;

s32: comparing the size of the first total region span and the second total region span;

s33: when the first total region span is greater than the second total region span, the second total region span is taken as the spatial region span, and when the first total region span is not greater than the second total region span, the first total region span is taken as the spatial region span.

In the embodiment of the present application, there are a plurality of designated directions as an example, that is, a plurality of areas make sounds simultaneously or continuously, so that the plurality of areas need to be precisely positioned visually in sequence. Firstly, selecting the largest coverage radian interval as the total area span according to all coverage radian intervals from a plurality of designated azimuths to the azimuths of the visual central line before rotation. And taking the visual central line position before rotation as a starting point, and clockwise rotating the visual central line position sequentially to pass through the maximum coverage radian intervals of all the designated positions to be used as a first total area span. And taking the visual central line position before rotation as a starting point, and rotating anticlockwise to sequentially pass through the maximum coverage radian intervals of all the designated positions to serve as a second total area span. After the rotating direction is selected, the action data of the user corresponding to each designated direction are sequentially analyzed, and accurate positioning of the speaker is achieved.

Further, the step S6 of receiving the data result output after the VGG network identification calculation, and determining whether the sound source position is consistent with the specified position according to the data result of the VGG network includes:

s61: analyzing whether the data result comprises opening and closing actions of the mouth;

s62: if yes, determining whether the current sound source position is the designated position again;

s63: if so, judging that the sound source direction is consistent with the specified direction, otherwise, judging that the sound source direction is inconsistent with the specified direction.

Whether the mouth opening and closing action exists or not is analyzed, whether speaking is carried out or not is preliminarily judged, if speaking is preliminarily judged, sound source positioning is called again for auxiliary analysis, and if the sound source positioning and the visual positioning both point to the designated user as a speaker, the designated user is judged as the speaker. That is, if there is mouth movement plus the correct sound bearing for the specified user, then it is determined that the specified user is speaking. If the judgment directions of the two are not focused, the sound source user can be found by continuously circulating the judgment process. Such as where there is mouth movement but the sound orientation for a given user is not the source of this orientation. The VGG can only process the static picture information to identify the feature of the mark point in the picture, for example, the VGG can not directly determine the action information, such as mouth opening and closing action, by the VGG to identify the fruit type according to the feature of the mark point in the picture. In the embodiment, a plurality of frames of pictures of a motion video are spliced and then input into the VGG, a change track of a mark point position in the picture is obtained according to output data of the VGG, whether opening and closing motions exist in the mouth part is judged, the sound source positioning is combined to judge the position consistency of the opening and closing motions of the mouth part and the sound source positioning, and if the opening and closing motions exist in the mouth part of the video captured in the position and sound source sound also exists in the position, the user is judged to be a speaker, namely a sound source user. The sound source orientation is still determined by adopting a microphone array sound source positioning technology.

Further, before the step S61 of analyzing whether the data result includes the mouth opening and closing action, the method includes:

s60a, judging whether the focusing condition of the camera is normal relative to the distance between the appointed user and the camera;

s60b, if yes, judging whether the resolution of the user portrait acquired under the focusing condition is in a preset range;

and S60c, if yes, controlling the VGG network to identify the calculation, otherwise, terminating the calculation.

Preferably, to further ensure the privacy and security of the action data, the action data may also be stored in a node of a block chain.

It should be noted that the blockchain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In addition, the scheme can also be applied to the field of intelligent traffic, so that the construction of an intelligent city is promoted.

In the embodiment, the interference of the virtual person in the electronic screen on the positioning of the real speaker is eliminated through the resolution, and because the electronic screen has the light reflection property, the resolution of the shot image or video of the real user is far higher than the resolution of the shot virtual user in the electronic screen under the same distance and the same focusing condition. And when the resolution does not meet the requirement, directly terminating the VGG network identification calculation, and outputting a conclusion that whether the sound source position is inconsistent with the specified position.

Referring to fig. 2, an apparatus for locating a sound source user according to an embodiment of the present application includes:

the first obtaining module 1 is configured to obtain a designated position corresponding to a sound source identified by sound source positioning and a visual centerline position corresponding to a current spatial position of the robot.

And the obtaining module 2 is used for obtaining the pre-rotation space region span according to the designated orientation and the visual central line orientation.

And the rotating module 3 is used for controlling the robot to rotate according to the pre-rotated space region span, and the robot is rotated to the specified position and positioned in the visual range of the robot.

And the first judgment module 4 is used for judging whether the user portrait of the specified user is acquired in the visual field range of the robot.

And the second obtaining module 5 is configured to, if yes, obtain the action data of the specified user, perform processing in a preset manner to obtain a processing result, and input the processing result to the VGG network for identification calculation to obtain an action type corresponding to the action data.

And the receiving module 6 is configured to receive a data result output after the VGG network identification calculation, and determine whether the sound source position is consistent with the specified position according to the data result of the VGG network, where the data result includes that the action type belongs to a mouth action.

And the judging module 7 is configured to judge that the designated user in the designated direction is the sound source user if the designated user is the sound source user.

Further, the second obtaining module 5 includes:

the first acquisition unit is used for acquiring action data of the specified user in a specified time period, wherein the action data is a continuous multi-frame action sequence;

a splicing unit for splicing successive multiframesThe sequence of actions described above by

and the input unit is used for inputting the static image data into the VGG network for identification calculation.

The specified time period refers to a continuous time span of mouth motion videos collected by the camera. The mouth action video collected by the camera is split into continuous multi-frame action sequences, sequential splicing is realized according to the time sequence, and the mouth action video forms static image data so as to be identified and calculated by the VGG network. Each person's behavior may be determined by a number of key points, including mouth movements, e.g., 15 key points for mouth movements, i ═ 0 to 14. Through improving the input end of the VGG network, the continuous multi-frame action sequence can be processed, and the mouth action is recognized. B is_i,k(t) denotes a transformation matrix and k denotes a dimension, e.g.

This formula can also be written as

Further, an apparatus for localizing a sound source user, comprising:

the second judgment module is used for judging whether the number of the specified users in the visual field range of the robot is two or more;

the selection module is used for selecting square areas corresponding to the designated users in the visual field graph corresponding to the visual field range of the robot according to a Yolov3 algorithm if the selection module is yes;

and the intercepting module is used for respectively intercepting the series of actions in the specified time period corresponding to each square area as the action data.

Further, the obtaining module 2 includes:

a second acquisition unit configured to acquire a first region span when rotated clockwise from the visual centerline orientation to the specified orientation, and a second region span when rotated counterclockwise from the visual centerline orientation to the specified orientation;

a first comparison unit for comparing the sizes of the first region span and the second region span;

a first determining unit configured to determine the second region span as the spatial region span when the first region span is larger than the second region span, and determine the first region span as the spatial region span when the first region span is not larger than the second region span.

Further, in another embodiment, the obtaining module 2 includes:

a third acquiring unit, configured to acquire a first total region span corresponding to clockwise rotation from the visual centerline orientation through all the designated orientations, and a second total region span corresponding to counterclockwise rotation from the visual centerline orientation through all the designated orientations;

a second comparing unit, configured to compare sizes of the first total region span and the second total region span;

a second determining unit configured to determine the second total region span as the spatial region span when the first total region span is greater than the second total region span, and determine the first total region span as the spatial region span when the first total region span is not greater than the second total region span.

Further, the receiving module 6 includes:

the analysis unit is used for analyzing whether the data result comprises opening and closing actions of the mouth or not;

the determining unit is used for determining whether the current sound source position is the designated position again if the current sound source position is the designated position;

and the judging unit is used for judging that the sound source direction is consistent with the specified direction if the sound source direction is consistent with the specified direction, and otherwise, judging that the sound source direction is inconsistent.

Further, the receiving module 6 includes:

the first judging unit is used for judging whether the focusing condition of the camera is normal relative to the distance from the specified user to the camera;

a second judging unit, configured to judge whether a resolution of the user portrait acquired under the focusing condition is within a preset range if the user portrait is in the preset range;

and the control unit is used for controlling the VGG network to identify and calculate if the VGG network is in the presence of the identification, and stopping the calculation if the VGG network is not in the presence of the identification.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required by the process of locating the sound source user. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of localizing a sound source user.

The processor executes the method for locating the sound source user, and the method comprises the following steps: acquiring a designated position corresponding to a sound source identified by sound source positioning and a visual center line position corresponding to a current spatial position of the robot; obtaining a pre-rotation space region span according to the designated orientation and the visual central line orientation; controlling the robot to rotate according to the pre-rotated space region span until the designated orientation is within the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if so, acquiring the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data; receiving a data result output after the VGG network identification calculation, and judging whether a sound source position is consistent with the specified position according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action; and if so, judging that the designated user of the designated direction is the sound source user.

According to the computer equipment, the serial action data of the human is used as the input of the VGG network in the visual positioning, the distinguishing accuracy is improved through the action data, and the visual positioning and the sound positioning are comprehensively used, so that the accuracy of the robot for positioning the speaking target user is improved.

In one embodiment, the step of acquiring, by the processor, the motion data of the specified user, processing the motion data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for identification calculation to obtain a motion type corresponding to the motion data includes: acquiring action data of the designated user in a designated time period, wherein the action data is a continuous multi-frame action sequence; successive frames of said sequence of actions are passed

Merging and splicing into one static image data, wherein p_i∈RⁿI represents the key point at the time t, and i represents the serial number of the key point; b is_i,k(t) denotes a transformation matrix, k denotes a dimension, p (t) is t ∈ [ t [ [ t ]_i,t_i+1) Static image data output over time; and inputting the static image data into a VGG network for identification calculation.

In an embodiment, before the step of obtaining the action data of the specified user, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for identification calculation to obtain an action type corresponding to the action data, the processor includes: judging whether the number of the appointed users in the visual field range of the robot is two or more; if yes, selecting square areas corresponding to the designated users in a visual field graph corresponding to the visual field range of the robot according to a Yolov3 algorithm; and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

In one embodiment, the step of obtaining the pre-rotated spatial region span according to the designated orientation and the visual centerline orientation by the processor comprises: obtaining a first region span when rotated clockwise from the visual centerline orientation to the specified orientation and a second region span when rotated counterclockwise from the visual centerline orientation to the specified orientation; comparing the size of the first region span and the second region span; when the first region span is greater than the second region span, the second region span is taken as the spatial region span, and when the first region span is not greater than the second region span, the first region span is taken as the spatial region span.

In one embodiment, the processor specifies two or more orientations, the spatial region span includes two or more spatial region spans, and the step of obtaining the pre-rotated spatial region span according to the specified orientations and the visual centerline orientation includes: obtaining a first total region span corresponding to clockwise rotation from the visual centerline orientation through all of the designated orientations, and a second total region span corresponding to counterclockwise rotation from the visual centerline orientation through all of the designated orientations; comparing the size of the first total region span and the second total region span; when the first total region span is greater than the second total region span, the second total region span is taken as the spatial region span, and when the first total region span is not greater than the second total region span, the first total region span is taken as the spatial region span.

In an embodiment, the step of receiving the data result output after the VGG network identification calculation and determining whether the sound source position is consistent with the specified position according to the data result of the VGG network by the processor includes: analyzing whether the data result comprises opening and closing actions of the mouth; if yes, determining whether the current sound source position is the designated position again; if so, judging that the sound source direction is consistent with the specified direction, otherwise, judging that the sound source direction is inconsistent with the specified direction.

In one embodiment, the step of analyzing whether the data result includes mouth opening and closing actions is preceded by the processor, and the step of analyzing whether the data result includes mouth opening and closing actions includes: judging whether the focusing condition of the camera is normal relative to the distance from the appointed user to the camera; if yes, judging whether the resolution of the user portrait acquired under the focusing condition is within a preset range; and if so, controlling the VGG network to identify the calculation, otherwise, terminating the calculation.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application further provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements a method of locating a sound source user, comprising: acquiring a designated position corresponding to a sound source identified by sound source positioning and a visual center line position corresponding to a current spatial position of the robot; obtaining a pre-rotation space region span according to the designated orientation and the visual central line orientation; controlling the robot to rotate according to the pre-rotated space region span until the designated orientation is within the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if so, acquiring the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data; receiving a data result output after the VGG network identification calculation, and judging whether a sound source position is consistent with the specified position according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action; and if so, judging that the designated user of the designated direction is the sound source user.

The computer readable storage medium improves the accuracy of the robot in locating the speaking target user by using the series of human motion data as the input of the VGG network in the visual positioning, improving the distinguishing accuracy through the motion data, and comprehensively using the visual positioning and the sound positioning.

Merging and splicing into one static image data, wherein p_i∈RⁿI represents the key point at time t, i represents the serial number of the key point; b is_i,k(t) denotes a transformation matrix, k denotes a dimension, p (t) is t ∈ [ t [ [ t ]_i,t_i+1) Static image data output over time; and inputting the static image data into a VGG network for identification calculation.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for locating a sound source user, comprising:

2. The method for positioning a sound source user according to claim 1, wherein the step of obtaining the action data of the specified user and performing a preset processing to obtain a processing result, and inputting the processing result to a VGG network for performing an identification calculation to obtain an action type corresponding to the action data comprises:

successive frames of said sequence of actions are passed

3. The method for positioning a sound source user according to claim 1, wherein before the step of obtaining the action data of the specified user and performing a preset processing to obtain a processing result, and inputting the processing result to a VGG network for performing an identification calculation to obtain an action type corresponding to the action data, the method comprises:

4. The method of claim 1, wherein said step of deriving a pre-rotated spatial zone span based on said designated orientation and said visual centerline orientation comprises:

comparing the size of the first region span and the second region span;

5. The method of claim 1, wherein the number of designated orientations is two or more, the span of spatial region comprises two or more, and the step of deriving the pre-rotated span of spatial region based on the designated orientations and the visual centerline orientation comprises:

6. The method of claim 1, wherein the step of receiving the data result output after the VGG network identification calculation and determining whether the sound source position is consistent with the designated position according to the data result of the VGG network comprises:

7. The method of claim 6, wherein the step of analyzing whether the data results include mouth opening and closing actions is preceded by the step of:

and if so, controlling the VGG network to identify the calculation, otherwise, terminating the calculation.

8. An apparatus for locating a user of a sound source, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.