CN111650558B

CN111650558B - Method, device and computer equipment for positioning sound source user

Info

Publication number: CN111650558B
Application number: CN202010334984.2A
Authority: CN
Inventors: 龚连银; 苏雄飞; 周宝; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2023-10-10
Anticipated expiration: 2040-04-24
Also published as: CN111650558A; WO2021212608A1

Abstract

The application relates to artificial intelligence and blockchain technology, and discloses a method for positioning a sound source user, which comprises the following steps: acquiring a designated azimuth corresponding to a sound source identified by sound source localization and a visual center line azimuth corresponding to a current spatial position of a robot; obtaining a pre-rotated spatial region span according to the designated azimuth and the visual center line azimuth; controlling the robot to rotate according to the pre-rotated space region span until the designated azimuth is positioned in the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if yes, acquiring action data of a designated user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network to perform recognition calculation to obtain an action type; receiving a data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the designated azimuth according to the data result of the VGG network; if yes, judging the appointed user with the appointed azimuth as the sound source user, and realizing accurate positioning.

Description

Method, device and computer equipment for positioning sound source user

Technical Field

The present application relates to the field of computers, and in particular, to a method, an apparatus, and a computer device for locating a sound source user.

Background

Existing robotic systems typically have only one way to locate, either visually or acoustically. However, the requirement on the use environment is higher by visual positioning, a good light environment is needed, when a user is not in the range of the camera, the function cannot be used basically, the data volume to be processed by the visual positioning is large, and the requirement on the calculation capability of the robot system is higher. When the sound is positioned, the precision is lower, the interaction scene of accurate tracking cannot be satisfied, and the precision is lower in noisy environments. Therefore, the existing robot positioning system cannot meet the requirements of accurate positioning in various scenes.

Disclosure of Invention

The application mainly aims to provide a method for positioning a sound source user, and aims to solve the technical problem that the existing robot positioning system cannot meet the requirements of accurate positioning in various scenes.

The application provides a method for positioning a sound source user, which comprises the following steps:

acquiring a designated azimuth corresponding to a sound source identified by sound source localization and a visual center line azimuth corresponding to a current spatial position of a robot;

obtaining a pre-rotation space region span according to the appointed azimuth and the vision center line azimuth;

Controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot;

judging whether a user portrait of a specified user is acquired in the visual field range of the robot;

if yes, acquiring action data of the appointed user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data;

receiving a data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the appointed azimuth according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action;

if yes, judging the appointed user with the appointed azimuth as the sound source user.

Preferably, the step of obtaining the action data of the specified user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data includes:

acquiring action data of the appointed user in an appointed time period, wherein the action data is a continuous multi-frame action sequence;

By passing successive frames of said action sequence throughMerging and stitching into one static image data, wherein p _i ∈R ⁿ The key point at the time t is represented, and i represents the serial number of the key point; b (B) _i,k (t) represents a transformation matrix, k represents a dimension; p (t) is t.epsilon.t _i ,t _i+1 ) Static image data output in time;

and inputting the static image data into a VGG network for identification calculation.

Preferably, before the step of obtaining the action data of the specified user, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data, the method comprises the following steps:

judging whether the number of the appointed users in the visual field range of the robot is two or more;

if yes, selecting square areas corresponding to the appointed users respectively from a view field diagram corresponding to the view field range of the robot according to a Yolov3 algorithm;

and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

Preferably, the step of obtaining a pre-rotated spatial region span according to the specified orientation and the visual center line orientation includes:

Acquiring a first area span when rotating clockwise from the visual center line position to the appointed position and a second area span when rotating anticlockwise from the visual center line position to the appointed position;

comparing the size of the first region span with the second region span;

the second zone span is taken as the spatial zone span when the first zone span is greater than the second zone span, and the first zone span is taken as the spatial zone span when the first zone span is not greater than the second zone span.

Preferably, the number of the designated positions is two or more, the space region span includes two or more, and the step of obtaining the pre-rotated space region span according to the designated positions and the vision center line positions includes:

acquiring a first total area span corresponding to clockwise rotation from the visual center line orientation through all the specified orientations, and a second total area span corresponding to anticlockwise rotation from the visual center line orientation through all the corresponding second total area spans;

comparing the first total area span with the second total area span;

And when the first total area span is not larger than the second total area span, taking the first total area span as the space area span.

Preferably, the step of receiving the data result output after the VGG network identification calculation and judging whether the sound source azimuth is consistent with the designated azimuth according to the data result of the VGG network comprises the following steps:

analyzing whether the data result comprises opening and closing actions of the mouth;

if yes, determining whether the current sound source position is the appointed position or not again;

if yes, judging that the sound source azimuth is consistent with the appointed azimuth, otherwise, judging that the sound source azimuth is inconsistent.

Preferably, before the step of analyzing whether the data result includes opening and closing actions of the mouth, the step of analyzing includes:

judging whether the focusing condition of the camera is normal relative to the distance between the appointed user and the camera;

if yes, judging whether the resolution of the user portrait acquired under the focusing condition is within a preset range;

if yes, controlling the VGG network to recognize and calculate, otherwise, terminating calculation. The application also provides a device for locating the sound source user, which comprises:

The first acquisition module is used for acquiring a designated azimuth corresponding to the sound source identified by sound source localization and a visual center line azimuth corresponding to the current spatial position of the robot;

the obtaining module is used for obtaining a pre-rotated space region span according to the appointed azimuth and the vision center line azimuth;

the rotating module is used for controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot;

the first judging module is used for judging whether the user portrait of the appointed user is acquired in the visual field range of the robot;

the second acquisition module is used for acquiring the action data of the appointed user if yes, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for recognition calculation to obtain an action type corresponding to the action data;

the receiving module is used for receiving the data result output after the VGG network identification calculation and judging whether the sound source azimuth is consistent with the appointed azimuth according to the data result of the VGG network;

and the judging module is used for judging that the appointed user with the appointed azimuth is a sound source user if the appointed user is a sound source user.

The application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.

According to the application, by taking the serial motion data of the person as the input of the VGG network in the visual positioning, the accuracy of distinguishing is improved through the motion data, and the visual positioning and the sound positioning are comprehensively used, so that the accuracy of a target user speaking in the robot positioning is improved.

Drawings

FIG. 1 is a flow chart of a method for locating a sound source user according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an apparatus for locating a sound source user according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, a method of locating a sound source user according to an embodiment of the present application includes:

s1: and acquiring the designated azimuth corresponding to the sound source identified by the sound source positioning and the visual center line azimuth corresponding to the current spatial position of the robot.

The above sound source localization is achieved by a microphone array. By setting delay parameters for each microphone in the array, different azimuth orientations are realized by controlling different delay parameters, a positioned area can be meshed, each grid point delays each microphone in a time domain, then the sound pressures of the microphone arrays are summed and calculated, and the azimuth of a sound source, namely the azimuth position of the sound source relative to the robot, namely the designated azimuth is determined through the sound pressures. The robot includes both sound source positioning and visual positioning, and the visual center line direction is a center position in the visual field. For example, the method is determined according to whether the robot adopts a monocular structure or a binocular structure, wherein the monocular structure takes the direction of a straight line which passes through the monocular center and is perpendicular to the plane of the face of the robot as the direction of the visual center line; the binocular structure takes the direction of a middleline passing through the midpoint of the binocular connecting line and perpendicular to the plane of the robot face as the visual center line direction.

S2: and obtaining the pre-rotated space region span according to the designated azimuth and the visual center line azimuth.

The space region span comprises a region corresponding to an arc range from the current visual center line position to the appointed position of the robot, and the region corresponds to the arc region when the current visual center line position rotates anticlockwise to the appointed position or the region corresponds to the arc region when the current visual center line position rotates clockwise to the appointed position. The sound source is initially positioned to assist the robot to quickly adjust the visual positioning direction, and response sensitivity and accuracy are improved.

S3: and controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot.

The appointed azimuth is positioned in the visual range of the robot, including any position in the visual range, and the appointed azimuth is preferably overlapped with the visual center line azimuth so as to improve the accuracy of visual positioning. Such rotation includes rotating the camera equipped head, or rotating the entire body of the robot. The rotation process can align the camera to the speaker's orientation, i.e., to a designated orientation, by controlling the robot waist and head yaw angle coordination.

S4: and judging whether the user portrait of the specified user is acquired in the visual field range of the robot.

The user representation includes a head representation, and a predictive determination is made as to whether the user is speaking by identifying a mouth motion in the head representation.

S5: if yes, acquiring action data of the appointed user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data.

When the head portrait exists, the user is considered to be speaking, and after further acquiring the mouth motion and processing the mouth motion in a preset mode, the user inputs the VGG network to perform deep analysis calculation on the mouth motion type. The processing in the preset mode comprises the step of splicing the acquired mouth motion video information into single picture information carrying a time sequence so as to be recognized by the VGG network.

S6: and receiving the data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the appointed azimuth according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action.

S7: if yes, judging the appointed user with the appointed azimuth as the sound source user.

The data result output by the VGG network comprises whether a mouth motion exists or not, for example, if the mouth morphology in the picture information is changed greatly according to the time sequence, the mouth motion is considered to exist, and if not, the mouth motion is not. If the VGG network judges that the appointed user at the appointed position has the mouth action and the appointed sound source position appointed by the sound source positioning is consistent with the pre-appointed position, the appointed user is determined to be the sound source user. Through combining the advantage of vision localization and sound source localization, realize the accurate localization to the sound source user, can find the speaker fast, improve speaker and robot's human-computer interaction experience and interaction effect. The embodiment of the application determines the approximate position of the target user through the sound source positioning technology and rapidly gives a positioning result; and then, accurately positioning the target user through visual positioning, wherein in the visual positioning, the series of motion data of the person is used as the input of the VGG network, and the accuracy of distinguishing the target user is improved through the motion data. Before the motion data is input into the VGG network, the motion data is processed by a specific data processing mode so that the processed data can be recognized and calculated by the VGG network, interference of an imitation person or an object similar to a user on visual positioning is eliminated, and the target user refers to a designated user in a visual field range.

Further, the step S5 of obtaining the action data of the specified user, performing a processing in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data includes:

s51: acquiring action data of the appointed user in an appointed time period, wherein the action data is a continuous multi-frame action sequence;

s52: by passing successive frames of said action sequence throughMerging and stitching into one static image data, wherein p _i ∈R ⁿ The key point at the time t is represented, and i represents the serial number of the key point; b (B) _i,k (t) represents a transformation matrix, k represents a dimension; p (t) is t.epsilon.t _i ,t _i+1 ) Static image data output in time;

s53: and inputting the static image data into a VGG network for identification calculation.

The application applies image and video recognition technology in the artificial intelligence field, wherein the specified time period refers to the continuous time span of the mouth action video acquired by the camera. The mouth motion video acquired by the camera is split into continuous multi-frame motion sequences, and the continuous multi-frame motion sequences are spliced according to the time sequence, so that the mouth motion video is formed into static image data to be recognized and calculated by the VGG network. Each person's behavior may be determined by a few key points, including mouth movements, e.g., 15 key points for mouth movements, i=0 to 14. By improving the input end of the VGG network, the VGG network can process continuous multi-frame action sequences, and the mouth action can be identified. B (B) _i,k (t) represents a transformation matrix, k represents a dimension, such asp (t) is t.epsilon.t _i ,t _i+1 ) Output results in time, R ⁿ Representing integers in real numbers.

This formula can also be written asIs equivalent to t epsilon t in the last arbitrary time period _i ,t _i+1 ) The key point information of the users is synthesized by the multi-frame motion key points, thereby realizing the information structure of synthesizing and inputting multi-frame continuous motion sequences, and the VGG network classification result can be aimed at the motion, M ₆ Representing a matrix of 6*6.

Further, before step S5 of obtaining the action data of the specified user, performing a processing in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for identification and calculation to obtain an action type corresponding to the action data, the method includes:

s50a, judging whether the number of the appointed users in the visual field range of the robot is two or more;

s50b, if yes, selecting square areas corresponding to the appointed users respectively from a view map corresponding to the view range of the robot according to a Yolov3 algorithm;

s50c: and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

For the situation that a plurality of people exist in the same designated azimuth or the current visual field range, the embodiment of the application selects the positions of the plurality of people respectively by using the square blocks according to the Yolov3 algorithm, namely the square block areas, then each person is intercepted to serve as the action data of the corresponding user, the time dimension information is utilized to obtain the feature quantity with higher dimension, and the analysis accuracy is improved. Yolov3 is the target detector of one-stage End2 End. Yolov3 divides the input image into S x S boxes, each box predicts B bounding boxes, each bounding box prediction content includes Location (x, y, w, h), confidence Score, and probabilities of C categories, so that the number of channels of the Yolov3 output layer is S x B (5+C). The loss function of Yolov3 consists of three parts: location error, confidence error, and classification error.

Further, the step S2 of obtaining a pre-rotated spatial region span according to the specified azimuth and the visual center line azimuth includes:

s21: acquiring a first area span when rotating clockwise from the visual center line position to the appointed position and a second area span when rotating anticlockwise from the visual center line position to the appointed position;

S22: comparing the size of the first region span with the second region span;

s23: the second zone span is taken as the spatial zone span when the first zone span is greater than the second zone span, and the first zone span is taken as the spatial zone span when the first zone span is not greater than the second zone span.

In this embodiment, taking an example that a specified azimuth exists, when a sound source at the specified azimuth is received to emit sound, the visual center line azimuth rotates to a direction corresponding to the specified azimuth, so that the specified azimuth is located in the rotated visual field range, and preferably, the specified azimuth pre-rotates to coincide with the adjusted visual center line azimuth. To facilitate quick response, the span of the spatial region to be rotated is controlled with a radian region having a small span.

Further, the number of the specified orientations is two or more, the space region span includes two or more, and the step S2 of obtaining the pre-rotated space region span according to the specified orientations and the visual center line orientation includes:

s31: acquiring a first total area span corresponding to clockwise rotation from the visual center line orientation through all the specified orientations, and a second total area span corresponding to anticlockwise rotation from the visual center line orientation through all the corresponding second total area spans;

S32: comparing the first total area span with the second total area span;

s33: and when the first total area span is not larger than the second total area span, taking the first total area span as the space area span.

In the embodiment of the application, taking an example that a plurality of designated directions exist, that is, a plurality of areas emit sounds simultaneously or sequentially, visual accurate positioning is needed to be performed on the plurality of areas in sequence. Firstly, selecting the largest coverage radian interval as the total area span according to all coverage radian intervals from a plurality of designated directions to the visual center line direction before rotation. The visual center line direction before rotation is taken as a starting point, and the largest covered radian interval passing through each designated direction sequentially is taken as a first total area span. The visual center line direction before rotation is taken as a starting point, and the largest covered radian interval passing through each designated direction sequentially is taken as a second total area span in a anticlockwise rotation mode. After the rotation direction is selected, action data of the users corresponding to the designated directions are analyzed sequentially, so that accurate positioning of the speaker is achieved.

Further, the step S6 of receiving the data result output after the VGG network identification calculation and determining whether the sound source azimuth is consistent with the designated azimuth according to the data result of the VGG network includes:

s61: analyzing whether the data result comprises opening and closing actions of the mouth;

s62: if yes, determining whether the current sound source position is the appointed position or not again;

s63: if yes, judging that the sound source azimuth is consistent with the appointed azimuth, otherwise, judging that the sound source azimuth is inconsistent.

And (3) whether the mouth opening and closing actions exist or not is analyzed, whether the voice is being played is primarily judged, if the voice is being played, the sound source localization is called again to carry out auxiliary analysis, and if the sound source localization and the visual localization are both directed to the appointed user as the speaker, the appointed user is judged to be the speaker. I.e. if there is a mouth motion plus the correct sound orientation for the specified user, then it is determined that the specified user is speaking. If the judgment points to unfocused, the sound source user, namely the speaker, is searched by continuing to circulate the judgment flow. Such as where there is a mouth motion but the sound orientation for the given user is not the source of this orientation. The VGG can only process the still picture information to identify the feature of the mark point in the picture, for example, fruit type identification is performed according to the feature of the mark point in the picture, and motion information such as mouth opening and closing motion cannot be obtained directly through VGG measurement. According to the embodiment, multiple frames of pictures of the action video are spliced and then input into the VGG, the change track of the mark point position in the pictures is obtained according to the output data of the VGG, whether the opening and closing action of the mouth exists or not is judged, the consistency of the opening and closing action of the mouth and the azimuth of the sound source positioning is judged by combining the sound source positioning, and if the opening and closing action of the mouth of a user exists in the video captured by the azimuth, and meanwhile sound source sound exists in the azimuth, the user is judged to be a speaker, namely a sound source user. The above sound source orientation is still determined using microphone array sound source localization techniques.

Further, before step S61 of analyzing whether the data result includes opening and closing actions of the mouth, the method includes:

s60a, judging whether the focusing condition of the camera is normal relative to the distance between the appointed user and the camera;

s60b, if so, judging whether the resolution of the user portrait acquired under the focusing condition is within a preset range;

and S60c, if yes, controlling the VGG network to recognize and calculate, otherwise, terminating calculation.

Preferably, to further guarantee the privacy and security of the action data, the action data may also be stored in a node of a blockchain.

It should be noted that, the blockchain referred to in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

In addition, the scheme can be applied to the intelligent traffic field, so that the construction of the intelligent city is promoted.

In this embodiment, the interference of the virtual character in the electronic screen on the positioning of the real speaker is eliminated through the resolution, and the resolution of the photographed image or video of the real user is far higher than the resolution of the virtual user in the photographed electronic screen under the conditions of equal distance and equal focusing due to the reflectivity of the electronic screen. And when the resolution ratio does not meet the requirement, directly terminating VGG network identification calculation, and outputting a conclusion whether the sound source azimuth is inconsistent with the designated azimuth.

Referring to fig. 2, an apparatus for locating a sound source user according to an embodiment of the present application includes:

the first acquiring module 1 is configured to acquire a designated azimuth corresponding to a sound source identified by sound source localization, and a visual center line azimuth corresponding to a current spatial position of the robot.

And the obtaining module 2 is used for obtaining the pre-rotated space region span according to the designated direction and the vision center line direction.

And the rotation module 3 is used for controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot.

And the first judging module 4 is used for judging whether the user portrait of the specified user is acquired in the visual field range of the robot.

And the second acquisition module 5 is used for acquiring the action data of the appointed user if yes, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for recognition calculation to obtain the action type corresponding to the action data.

And the receiving module 6 is used for receiving the data result output after the VGG network identification calculation and judging whether the sound source direction is consistent with the designated direction according to the data result of the VGG network, wherein the data result comprises that the action type belongs to the mouth action.

And the judging module 7 is used for judging that the appointed user with the appointed azimuth is a sound source user if the appointed user is a sound source user.

Further, the second obtaining module 5 includes:

the first acquisition unit is used for acquiring action data of the appointed user in an appointed time period, wherein the action data is a continuous multi-frame action sequence;

a splicing unit for passing the continuous multi-frame action sequences throughMerging and stitching into one static image data, wherein p _i ∈R ⁿ The key point at the time t is represented, and i represents the serial number of the key point; b (B) _i,k (t) represents a transformation matrix, k represents a dimension; p (t) is t.epsilon.t _i ,t _i+1 ) Static image data output in time;

and the input unit is used for inputting the static image data into the VGG network for identification calculation.

The specified time period refers to a continuous time span of the mouth motion video collected by the camera. The mouth motion video acquired by the camera is split into continuous multi-frame motion sequences, and the continuous multi-frame motion sequences are spliced according to the time sequence, so that the mouth motion video is formed into static image data to be recognized and calculated by the VGG network. Each of which isThe behavior of a person may be determined by a number of key points, including mouth movements, e.g. 15 key points for mouth movements, i=0 to 14. By improving the input end of the VGG network, the VGG network can process continuous multi-frame action sequences, and the mouth action can be identified. B (B) _i,k (t) represents a transformation matrix, k represents a dimension, such asp (t) is t.epsilon.t _i ,t _i+1 ) Output results in time, R ⁿ Representing integers in real numbers.

Further, an apparatus for locating a sound source user, comprising:

the second judging module is used for judging whether the number of the appointed users in the visual field range of the robot is two or more;

the selection module is used for selecting the square areas corresponding to the appointed users respectively in the view field diagrams corresponding to the view field ranges of the robots according to the Yolov3 algorithm if yes;

and the intercepting module is used for intercepting a series of actions in the specified time period corresponding to each square area as the action data.

Further, the obtaining module 2 includes:

a second acquisition unit configured to acquire a first area span when rotating clockwise from the visual center line orientation to the specified orientation and a second area span when rotating counterclockwise from the visual center line orientation to the specified orientation;

a first comparing unit for comparing the sizes of the first region span and the second region span;

and a first unit configured to take the second area span as the spatial area span when the first area span is greater than the second area span, and take the first area span as the spatial area span when the first area span is not greater than the second area span.

Further, in another embodiment, the obtaining module 2 includes:

A third acquisition unit configured to acquire a first total area span corresponding to clockwise rotation from the visual center line orientation through all the specified orientations, and a second total area span corresponding to counterclockwise rotation from the visual center line orientation through all the corresponding second total area spans;

a second comparing unit for comparing the sizes of the first total area span and the second total area span;

and a second unit configured to take the second total area span as the spatial area span when the first total area span is greater than the second total area span, and take the first total area span as the spatial area span when the first total area span is not greater than the second total area span.

Further, the receiving module 6 includes:

the analysis unit is used for analyzing whether the data result comprises opening and closing actions of the mouth;

the determining unit is used for determining whether the current sound source azimuth is the appointed azimuth again if yes;

and the judging unit is used for judging that the sound source azimuth is consistent with the appointed azimuth if yes, and is inconsistent otherwise.

Further, the receiving module 6 includes:

the first judging unit is used for judging whether the focusing condition of the camera is normal relative to the distance between the appointed user and the camera;

the second judging unit is used for judging whether the resolution ratio of the user portrait acquired under the focusing condition is in a preset range or not if yes;

and the control unit is used for controlling the VGG network to recognize and calculate if yes, and terminating calculation if not.

Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required for the process of locating the sound source user. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of locating a sound source user.

The processor executes the method for locating the sound source user, which comprises the following steps: acquiring a designated azimuth corresponding to a sound source identified by sound source localization and a visual center line azimuth corresponding to a current spatial position of a robot; obtaining a pre-rotation space region span according to the appointed azimuth and the vision center line azimuth; controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if yes, acquiring action data of the appointed user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data; receiving a data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the appointed azimuth according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action; if yes, judging the appointed user with the appointed azimuth as the sound source user.

According to the computer equipment, the serial motion data of the person is used as the input of the VGG network in the visual positioning, the distinguishing precision is improved through the motion data, and the visual positioning and the sound positioning are comprehensively used, so that the precision of a target user speaking in the robot positioning is improved.

In one embodiment, the step of obtaining the action data of the specified user by the above processor, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data includes: acquiring action data of the appointed user in an appointed time period, wherein the action data is a continuous multi-frame action sequence; by passing successive frames of said action sequence throughMerging and stitching into one static image data, wherein p _i ∈R ⁿ The key point at the time t is represented, and i represents the serial number of the key point; b (B) _i,k (t) represents a transformation matrix, k represents a dimension; p (t) is t.epsilon.t _i ,t _i+1 ) Static image data output in time; and inputting the static image data into a VGG network for identification calculation.

In one embodiment, the step of obtaining the action data of the specified user by the above processor, processing the action data in a preset manner to obtain a processing result, and inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data includes: judging whether the number of the appointed users in the visual field range of the robot is two or more; if yes, selecting square areas corresponding to the appointed users respectively from a view field diagram corresponding to the view field range of the robot according to a Yolov3 algorithm; and respectively intercepting a series of actions in the specified time period corresponding to each square area as the action data.

In one embodiment, the step of obtaining the pre-rotated spatial region span by the processor according to the specified azimuth and the visual center line azimuth includes: acquiring a first area span when rotating clockwise from the visual center line position to the appointed position and a second area span when rotating anticlockwise from the visual center line position to the appointed position; comparing the size of the first region span with the second region span; the second zone span is taken as the spatial zone span when the first zone span is greater than the second zone span, and the first zone span is taken as the spatial zone span when the first zone span is not greater than the second zone span.

In one embodiment, the number of the specified orientations of the processor is two or more, the spatial region span includes two or more, and the step of obtaining the pre-rotated spatial region span according to the specified orientations and the visual center line orientation includes: acquiring a first total area span corresponding to clockwise rotation from the visual center line orientation through all the specified orientations, and a second total area span corresponding to anticlockwise rotation from the visual center line orientation through all the corresponding second total area spans; comparing the first total area span with the second total area span; and when the first total area span is not larger than the second total area span, taking the first total area span as the space area span.

In one embodiment, the step of receiving the data result output after the VGG network identification calculation by the processor and determining whether the sound source azimuth is consistent with the designated azimuth according to the data result of the VGG network includes: analyzing whether the data result comprises opening and closing actions of the mouth; if yes, determining whether the current sound source position is the appointed position or not again; if yes, judging that the sound source azimuth is consistent with the appointed azimuth, otherwise, judging that the sound source azimuth is inconsistent.

In one embodiment, before the step of analyzing whether the data result includes opening and closing actions of the mouth, the method includes: judging whether the focusing condition of the camera is normal relative to the distance between the appointed user and the camera; if yes, judging whether the resolution of the user portrait acquired under the focusing condition is within a preset range; if yes, controlling the VGG network to recognize and calculate, otherwise, terminating calculation.

It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of locating a sound source user, comprising: acquiring a designated azimuth corresponding to a sound source identified by sound source localization and a visual center line azimuth corresponding to a current spatial position of a robot; obtaining a pre-rotation space region span according to the appointed azimuth and the vision center line azimuth; controlling the robot to rotate according to the pre-rotated space region span until the designated direction is positioned in the visual range of the robot; judging whether a user portrait of a specified user is acquired in the visual field range of the robot; if yes, acquiring action data of the appointed user, processing the action data in a preset mode to obtain a processing result, and inputting the processing result into a VGG network for identification calculation to obtain an action type corresponding to the action data; receiving a data result output after the VGG network identification calculation, and judging whether the sound source azimuth is consistent with the appointed azimuth according to the data result of the VGG network, wherein the data result comprises that the action type belongs to mouth action; if yes, judging the appointed user with the appointed azimuth as the sound source user.

According to the computer readable storage medium, the serial motion data of the person is used as the input of the VGG network in the visual positioning, the accuracy of distinguishing is improved through the motion data, and the visual positioning and the sound positioning are comprehensively used, so that the accuracy of a target user speaking in the robot positioning is improved.

In one embodiment, the processor obtains the action data of the specified user, processes the action data in a preset manner to obtain a processing result, and inputs the processing result to a VGG network for recognition calculation to obtain an action class corresponding to the action dataA step of forming, comprising: acquiring action data of the appointed user in an appointed time period, wherein the action data is a continuous multi-frame action sequence; by passing successive frames of said action sequence throughMerging and stitching into one static image data, wherein p _i ∈R ⁿ The key point at the time t is represented, and i represents the serial number of the key point; b (B) _i,k (t) represents a transformation matrix, k represents a dimension; p (t) is t.epsilon.t _i ,t _i+1 ) Static image data output in time; and inputting the static image data into a VGG network for identification calculation.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method of locating a sound source user, comprising:

if yes, judging the appointed user in the appointed direction as a sound source user;

the step of obtaining the action data of the appointed user, processing the action data in a preset mode to obtain a processing result, inputting the processing result into a VGG network for recognition and calculation to obtain the action type corresponding to the action data comprises the following steps:

2. The method for locating a sound source user according to claim 1, wherein before the step of obtaining the action data of the specified user and processing the action data in a preset manner to obtain a processing result, inputting the processing result to a VGG network for recognition calculation to obtain an action type corresponding to the action data, the method comprises:

and respectively intercepting a series of actions in a specified time period corresponding to each square area as the action data.

3. The method of locating a sound source user according to claim 1, wherein said step of deriving a pre-rotated spatial zone span from said specified bearing and said visual centerline bearing comprises:

comparing the size of the first region span with the second region span;

4. The method of locating a sound source user according to claim 1, wherein the number of specified orientations is two or more, the spatial region span includes two or more, and the step of obtaining a pre-rotated spatial region span from the specified orientations and the visual center line orientation includes:

Comparing the first total area span with the second total area span;

5. The method for locating a sound source user according to claim 1, wherein the step of receiving the VGG network identification calculated data result and determining whether the sound source position is consistent with the designated position according to the VGG network data result comprises:

6. The method of locating a sound source user according to claim 5, wherein before the step of analyzing whether the data result includes a mouth opening and closing action, comprising:

if yes, controlling the VGG network to recognize and calculate, otherwise, terminating calculation.

7. An apparatus for locating a sound source user for implementing the method of any one of claims 1 to 6, comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.