CN111048113A

CN111048113A - Sound direction positioning processing method, device and system, computer equipment and storage medium

Info

Publication number: CN111048113A
Application number: CN201911311585.8A
Authority: CN
Inventors: 张明远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-21
Anticipated expiration: 2039-12-18
Also published as: CN111048113B

Abstract

The application relates to a sound direction positioning processing method, a device, a system, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data collected from an environment; predicting at least one sound source direction based on the speech data; collecting images according to at least one predicted sound source direction, and identifying the body state characteristics of the language expression part of the target object from the collected images; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree between the posture characteristic and the voice data. The accuracy of location sound direction can be improved to this application scheme.

Description

Sound direction positioning processing method, device and system, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technology and speech processing technology, and in particular, to a method, an apparatus, a system, computer equipment, and a storage medium for sound direction positioning processing.

Background

With the rapid development of scientific technology, the application of voice technology is more and more extensive, and the voice technology is involved in a plurality of application scenes. Such as a speech recognition scenario or a scenario where the direction of speech is located.

In the traditional method, sound is collected through a microphone, and the sound direction is positioned through the analysis of sound data. Thus, in some noisy environments, the collected sound data includes much noise, which leads to inaccurate positioning of the sound direction.

Disclosure of Invention

In view of the above, it is necessary to provide a sound direction localization processing method, apparatus, system, computer device and storage medium for solving the problem of inaccurate sound direction localization in the conventional method.

A sound direction positioning processing method comprises the following steps:

acquiring voice data collected from an environment;

predicting at least one sound source direction based on the speech data;

collecting images according to at least one predicted sound source direction, and identifying the body state characteristics of the language expression part of the target object from the collected images;

and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree between the posture characteristics and the voice data.

In one embodiment, the voice data is at least two paths of voice data sent from the same sound source; the predicted sound source direction includes a sound emission angle;

predicting at least one sound source direction from the speech data comprises:

determining the phase difference between each path of voice data;

and predicting the corresponding sound production angle of the voice data according to the phase difference.

In one embodiment, obtaining speech data collected from an environment comprises:

acquiring voice data from the environment through a sound acquisition equipment array to obtain at least two paths of voice data sent from the same sound source; the sound collection equipment array comprises at least two sound collection equipment;

the image acquisition according to the predicted at least one sound source direction comprises:

and controlling the image acquisition equipment to acquire images according to the predicted at least one sound source direction in the process of keeping the sound acquisition equipment array to acquire the voice data.

In one embodiment, the predicted at least one sound source direction is at least two; determining a final sound source direction from the predicted at least one sound source direction according to a matching degree between the posture features and the voice data includes:

acquiring predicted direction values corresponding to the predicted sound source directions respectively; predicting a direction value for representing a probability that the voice data is derived from the direction of the sound source;

determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image collected according to the predicted sound source direction; a sound source direction value for representing a probability that the predicted sound source direction is the final sound source direction;

the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a human object; the language expression part is lip; the morphological characteristics of the language expression part are mouth shape characteristics;

identifying, from the captured image, a morphological feature of the linguistic expression portion of the target object includes:

locating a face region of a person object from the acquired image;

identifying a lip region in the face region;

mouth shape features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence; determining a final sound source direction from the predicted at least one sound source direction according to a matching degree between the posture features and the voice data includes:

recognizing continuous mouth shape characteristics to obtain a first sentence;

carrying out voice recognition on the voice data to obtain a second statement;

matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristic and the voice data;

and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

In one embodiment, the image collected in the final sound source direction includes at least two target objects; the matching degree is a first matching degree;

the method further comprises the following steps:

extracting voice print characteristic data of the voice data;

searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data;

matching the extracted voiceprint characteristic data with the searched voiceprint characteristic data to obtain second matching degrees corresponding to the target objects;

and identifying the sound-producing object from the target object according to the first matching degree and the second matching degree.

In one embodiment, identifying the target object from the target objects according to the first matching degree and the second matching degree comprises:

obtaining a prediction direction value corresponding to the final sound source direction;

determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree;

and identifying the target object corresponding to the maximum sound direction value as a sound production object.

In one embodiment, searching the stored voiceprint feature data for each target object includes:

for each target object, extracting extrinsic feature data of each target object from the image collected according to the final sound source direction;

matching the extrinsic feature data with stored extrinsic feature data of the target object;

and acquiring voiceprint characteristic data which are stored correspondingly to the matched extrinsic characteristic data to obtain the voiceprint characteristic data of the target object.

In one embodiment, the target object is a human object; the extrinsic feature data comprises face feature data;

extracting extrinsic feature data of each target object from the image collected according to the final sound source direction comprises the following steps:

locating a face region corresponding to each target object from an image collected according to the final sound direction;

and carrying out face recognition on the face area to obtain face characteristic data.

In one embodiment, the method further comprises:

and aiming at the sound-producing object, storing the extracted external characteristic data and the extracted voiceprint characteristic data of the sound-producing object corresponding to the sound-producing object so as to update the stored external characteristic data and voiceprint characteristic data corresponding to the sound-producing object.

A sound direction localization processing apparatus, the apparatus comprising:

the direction prediction module is used for acquiring voice data collected from the environment; predicting at least one sound source direction based on the speech data;

the image acquisition module is used for acquiring images according to the predicted at least one sound source direction and identifying the body state characteristics of the language expression part of the target object from the acquired images;

and the direction positioning module is used for determining the final sound source direction from at least one predicted sound source direction according to the matching degree between the posture characteristics and the voice data.

In one embodiment, the voice data is at least two paths of voice data sent from the same sound source; the predicted sound source direction includes a sound emission angle. The direction prediction module is also used for determining the phase difference between each path of voice data; and predicting the corresponding sound production angle of the voice data according to the phase difference.

In one embodiment, the direction prediction module is further configured to collect voice data from an environment through the sound collection device array to obtain at least two paths of voice data from the same sound source; the sound collection equipment array comprises at least two sound collection equipment; and controlling the image acquisition equipment to acquire images according to the predicted at least one sound source direction in the process of keeping the sound acquisition equipment array to acquire the voice data.

In one embodiment, the predicted at least one sound source direction is at least two. The direction positioning module is also used for acquiring predicted direction values corresponding to the predicted sound source directions respectively; predicting a direction value for representing a probability that the voice data is derived from the direction of the sound source; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image collected according to the predicted sound source direction; a sound source direction value for representing a probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a human object; the language expression part is lip; the morphological characteristics of the speech expression site are mouth shape characteristics. The direction positioning module is also used for positioning a human face area of the person object from the acquired image; identifying a lip region in the face region; mouth shape features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence. The direction positioning module is also used for identifying continuous mouth shape characteristics to obtain a first statement; carrying out voice recognition on the voice data to obtain a second statement; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristic and the voice data; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

In one embodiment, the image collected in the final sound source direction includes at least two target objects; the matching degree is a first matching degree. The device also includes:

the voice object recognition module is used for extracting voiceprint characteristic data of the voice data; searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint characteristic data with the searched voiceprint characteristic data to obtain second matching degrees corresponding to the target objects; and identifying the sound-producing object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound generating object identifying module is further configured to obtain a predicted direction value corresponding to a final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound production object.

In one embodiment, the sound-generating object identification module is further configured to extract, for each target object, extrinsic feature data of each target object from an image collected in a final sound source direction; matching the extrinsic feature data with stored extrinsic feature data of the target object; and acquiring voiceprint characteristic data which are stored correspondingly to the matched extrinsic characteristic data to obtain the voiceprint characteristic data of the target object.

In one embodiment, the target object is a human object; the extrinsic feature data includes face feature data. The voice object identification module is also used for positioning the face area corresponding to each target object from the image collected according to the final voice direction; and carrying out face recognition on the face area to obtain face characteristic data.

In one embodiment, the apparatus further comprises:

and the updating storage module is used for storing the extracted external characteristic data and the extracted voiceprint characteristic data of the voice object corresponding to the voice object so as to update the stored external characteristic data and voiceprint characteristic data corresponding to the voice object.

A sound direction localization processing system, the system comprising: a sound collection device and an image collection device;

a sound collection device for acquiring voice data collected from an environment; predicting at least one sound source direction based on the speech data;

the image acquisition equipment is used for carrying out image acquisition according to the predicted at least one sound source direction and identifying the body state characteristics of the language expression part of the target object from the acquired image;

the sound collection device is further configured to determine a final sound source direction from the predicted at least one sound source direction according to a degree of matching between the morphological feature and the speech data.

A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the sound direction localization processing method of the embodiments of the present application.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps in the sound direction localization processing method of the embodiments of the present application.

The sound direction positioning processing method, the sound direction positioning processing device, the computer equipment and the storage medium predict at least one sound source direction for the voice data collected from the environment, carry out image collection according to the predicted at least one sound source direction, and determine the final sound source direction according to the matching degree between the body state characteristics of the language expression part of the target object in the collected image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multimodal data. Therefore, even in a noisy environment, the sound source direction can be positioned in an auxiliary mode through the body state characteristics of the language expression part in the image, and therefore compared with the traditional method that the sound direction is positioned only according to the sound data, the positioning accuracy is improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary implementation of a method for directional localization processing of a sound;

FIG. 2 is a diagram illustrating an application scenario of the sound direction localization processing method in another embodiment;

FIG. 3 is a flow chart illustrating a method for processing sound direction localization in one embodiment;

FIG. 4 is a schematic diagram of a microphone array in one embodiment;

FIG. 5 is a schematic diagram of an image acquisition device in one embodiment;

FIG. 6 is a flow diagram illustrating the steps of the spoken object recognition process in one embodiment;

FIG. 7 is a simplified flowchart of a sound direction localization processing method according to an embodiment;

FIG. 8 is a block diagram of a sound direction localization processing device in one embodiment;

FIG. 9 is a block diagram of a sound direction localization processing device in another embodiment;

FIG. 10 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a diagram illustrating an application scenario of a sound direction localization processing method according to an embodiment. Referring to fig. 1, the application scenario includes a sound capture device 110 and an image capture device 120 that are communicatively coupled. The sound collection device 110 is a device for collecting voice data. The image pickup device 120 is a device for picking up an image. It is understood that the sound collection device 110 or the image collection device 120 may be a smart device having computer processing capabilities. That is, the sound direction localization processing method in the embodiments of the present application is executed by the sound collection apparatus 110 or the image collection apparatus 120. For example, the smart speaker or the smart camera may execute the sound direction positioning processing method in the embodiments of the present application.

The target object may speak and speak in the environment. The sound collection device 110 may collect speech data from the environment and predict at least one sound source direction from the speech data. The sound collection device 110 may control the image collection device 120 to perform image collection in the predicted at least one sound source direction while maintaining the collection of voice data by the sound collection device array by communicating with the image collection device 120. The sound collection device 110 may control the image collection device 120 to recognize the morphological feature of the linguistic expression portion of the target object from the collected image. The sound collection device 110 may determine a degree of match between the morphological feature and the speech data. The sound collection device 110 may determine a final sound source direction from the predicted at least one sound source direction according to the matching degree.

It is to be understood that the sound collection device 110 and the image collection device 120 may be general devices without a computer processing function. The sound direction localization processing method in the embodiments of the present application is performed by the computer device 130 that is communicatively connected to the sound collection device 110 and the image collection device 120, respectively. The computer device 130 may be a desktop computer or a mobile terminal, and the mobile terminal may include at least one of a mobile phone, a tablet computer, a notebook computer, a personal digital assistant, a wearable device, and the like. Fig. 2 is a diagram illustrating an application scenario of the sound direction localization processing method in another embodiment.

The target object may speak and speak in the environment. The computer device 130 may collect voice data from the environment by controlling the sound collection device 110. The computer device 130 may predict at least one sound source direction from the speech data. The computer device 130 may control the image pickup device 120 to perform image pickup in the predicted at least one sound source direction while maintaining the voice data picked up by the sound pickup device array. The computer device 130 may identify the morphological characteristics of the linguistic expression portion of the target object from the captured image; determining the matching degree between the posture characteristics and the voice data; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

It can be understood that the sound direction positioning processing method in the embodiments of the present application is equivalent to using an artificial intelligence technique to automatically position the sound source direction.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It can be understood that the sound direction positioning processing method in the embodiments of the present application may be applied to a speech processing scenario such as speech recognition or speech synthesis processing. Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

FIG. 3 is a flowchart illustrating a sound direction localization processing method according to an embodiment. The sound direction positioning processing method in this embodiment may be applied to a computer device, and now the computer device is mainly used as the sound collection device 110 or the computer device 130 in fig. 1 for illustration. Referring to fig. 3, the method specifically includes the following steps:

s302, voice data collected from the environment are obtained.

The environment, among others, is composed of various natural factors. The environment may include a real environment and a virtual environment. A real environment is an environment existing in real life. A virtual environment is an environment obtained by simulating a real environment.

Specifically, the voice data may be data that has been collected in advance, and the computer device may directly acquire the voice data collected from the environment. The computer device may also perform a voice capture process to capture voice data from the environment.

The target object may sound in the environment. The computer device may perform voice detection to detect whether there is voice input in the environment and, if so, collect input voice data from the environment.

The target object is an object that provides a sound source in the environment, that is, an object that emits sound to input voice data. For example, a person speaks in the environment, the computer device collects voice data of the person when speaking, and the person is the target object. It is to be understood that when the environment is a virtual environment, then the target object may be a virtual object, i.e., a virtualized target object.

It will be appreciated that when the computer device itself has the sound collection function, then the computer device may itself collect voice data from the environment. For example, when the computer device is an intelligent sound collection device, it may collect voice data from the environment itself. When the computer equipment does not have the sound collection function, the voice data can be collected from the environment by controlling the sound collection equipment.

In one embodiment, a computer device may collect voice data from an environment through a sound collection device of a single sound collection channel. In another embodiment, the computer device may also acquire the voice data from the environment through the sound acquisition device array with multiple sound acquisition channels to obtain at least two paths of voice data. It is understood that at least two sound collection devices are included in the array of sound collection devices.

S304, predicting at least one sound source direction according to the voice data.

The sound source direction refers to a direction from which sound is coming, i.e., a direction from which voice data is generated.

In particular, the computer device may analyze the speech data to predict at least one sound source direction.

It is understood that the computer device may collect a plurality of voice data through a plurality of voice collecting channels, and predict at least one sound source direction by analyzing the plurality of voice data. It is understood that the multi-path voice data refers to at least two paths of voice data. A voice acquisition channel acquires a path of voice data.

In other embodiments, the computer device may also collect voice data of a single channel through a single sound collection channel, and predict at least one sound source direction according to the voice data of the single channel.

In one embodiment, the voice data is at least two paths of voice data sent from the same sound source; the predicted sound source direction includes a sound emission angle. In this embodiment, step S304 includes: determining the phase difference between each path of voice data; and predicting the corresponding sound production angle of the voice data according to the phase difference.

The sound emission angle refers to an angle of a position where an object emitting sound is located compared with the sound collection device.

Phase is the position in a wave in its cycle at a particular instant in time.

It can be understood that the voice data of different paths are collected by different sound collection channels, and the distances between different sound collection channels and the target object are different, which may result in different phases between the voice data of different paths collected by the same sound source. Thus, there is a phase difference between the voice data of the respective paths originating from the same sound source. In other words, the phase difference can characterize the distance difference between different sound collection channels and the target object. However, the relative position between the sound collection channels is determined, so that the corresponding speaking angle of the voice data can be predicted according to the phase difference. The sound emission angle may be a plane sound emission angle, that is, an angle of a position on a horizontal plane, on which an object emitting sound is located, compared with the sound collection device with reference to the horizontal plane.

It should be noted that the computer device may calculate the phase difference between the multiple voice data corresponding to the same direction. Furthermore, the sound emission angle can be predicted, that is, at least one of the plurality of predicted sound source directions can be obtained.

In the above-described embodiments, it is possible to predict the sound emission angle, that is, predict the at least one sound source direction, based on the phase difference between the different pieces of speech data from the same sound source, and to predict the at least one sound source direction quickly and accurately.

S306, collecting images according to the predicted at least one sound source direction, and identifying the body state characteristics of the language expression part of the target object from the collected images.

The target object is an object that provides a sound source in the environment, that is, an object that emits sound to input voice data. The language expression site is a site for expressing a language. Posture, i.e., posture, refers to a posture or look. It is understood that the target object may be accompanied by a change in the posture of the speech expression site when uttering. For example, the target object may be accompanied by a change in lips, a change in gestures, or a change in body posture, when it sounds. The morphological characteristics of the language expression site change with the utterance of the target object to express different languages. That is, if the language of the language expression portion of the target object is different, the morphological feature of the language expression portion is different.

It is understood that the language expression portion may be a portion where a change in posture can occur. It is understood that the language may be expressed by a change in pose.

In one embodiment, the verbal expression of a location includes at least one of a hand, a lip, and other body parts capable of undergoing a change in posture. The morphological characteristics of the speech expression part comprise at least one of gesture characteristics, mouth shape characteristics and body posture characteristics.

In one embodiment, the target object may comprise a person object, i.e. the target object may comprise a person. The linguistic expression sites may include lips. It is understood that the voice of the person is finally outputted through the opening and closing of the lips, and therefore, the speech expression portion may include the lips. The morphological characteristics of the linguistic expression site may then include mouth shape characteristics. The mouth shape is the shape of the mouth of a person, and especially refers to the shape of two lips when a certain sound is made in speech science.

It is understood that when the target object is a human object, the language expression part may also include a hand and other body parts that can be posture-changed. It is understood that the target object may express the language by way of gesture changes (i.e. gestures) of the hand, body gesture changes, and the like.

In other embodiments, the target object may also include a robot object or an animal object, etc. The language expression site may include the mouth of a robot or the mouth of an animal subject.

In particular, the computer device may perform image acquisition, i.e. acquiring images located in the direction of the sound source, in accordance with the predicted at least one sound source direction.

It should be noted that, acquiring voice data is a continuous process, and during the image acquisition process, the computer device still acquires voice data from the environment synchronously. That is, speech data is collected from the environment while image collection is performed in accordance with at least one predicted sound source direction. Therefore, the voice data in the embodiments of the present application refers to continuously collected voice data.

It is understood that when the computer device itself has the image capturing function, the image capturing may be performed by the computer device itself in accordance with the predicted at least one sound source direction. When the computer device does not have the image capturing function, the computer device may perform image capturing in the predicted at least one sound source direction by controlling the image capturing device.

It should be noted that the computer device may be a device integrating the image capturing function and the sound capturing function. The computer device may also be a device that has an integrated image capture function or a sound capture function alone. The computer device may also be a device which does not have an integrated image capturing function and sound capturing function by itself, but captures an image by controlling the image capturing device and captures voice data by controlling the sound capturing device.

It will be appreciated that the target object is speaking in the environment, so after predicting at least one sound source direction from the speech data, the computer device will then perform image acquisition for that sound source direction while maintaining the acquisition of the speech data. Therefore, if the target object still speaks, the speaking target object can be captured, and the speaking content can be acquired by collecting voice data.

It is to be understood that the predicted at least one sound source direction may be at least two. Therefore, the computer device may perform image acquisition according to each of the predicted at least one sound source directions.

In particular, the computer device may identify the target object from the acquired image. When the target object is identified, the language expression part of the target object is positioned from the image area corresponding to the target object, and then the posture characteristic of the language expression part is extracted.

It is to be understood that when the target object is identified from the acquired image, step S306 is performed. When the target object is not recognized from the acquired image, it may be determined that the voice data is noise data, and voice detection may be continued to detect whether there is voice input in the environment, and if there is voice input, step S302 and subsequent steps may be re-executed.

In one embodiment, when the acquired images are at least two consecutive images (i.e., a sequence of images), the computer device may then perform step S306 for each acquired image separately.

In one embodiment, at least two target objects may be included in the same acquired image, in which case the computer device may then perform step S306 for each target object. That is, from the captured image, the posture characteristics of the speech expression portion of each target object are recognized. For example, if a plurality of persons are included in one image, the morphological characteristics of the linguistic expression portion of each person can be identified from the image.

In one embodiment, the step S302 of acquiring voice data collected from the environment includes: acquiring voice data from the environment through a sound acquisition equipment array to obtain at least two paths of voice data sent from the same sound source; the array of sound collection devices includes at least two sound collection devices. In this embodiment, the image capturing in the predicted at least one sound source direction in step S306 includes: and in the process of keeping the sound collection equipment array collecting the voice data, controlling the image collection equipment to collect images according to the predicted at least one sound source direction, and synchronously collecting the voice data from the environment through the sound collection equipment array.

The sound source collecting device array refers to an array obtained by arranging at least two sound collecting devices according to a preset rule. Each sound collection device in the sound collection device array is equivalent to a sound collection channel so as to collect voice from the environment and obtain at least two paths of voice data.

In one embodiment, the array of sound collection devices may be an array of microphones. The microphone array includes microphones. It is to be understood that the microphone array may comprise at least two sound collection channels.

It can be understood that the phase difference between the voice data is used to represent the phase difference of the same sound source reaching the sound collection channels at different positions in the array of sound collection devices.

Fig. 4 is a schematic diagram of a microphone array in one embodiment. Referring to fig. 4, a sound collection device 402, which is an integrated multiple microphone, includes multiple sound collection channels. As can be seen from fig. 4, the sound collection channels located at the region S1 have a phase difference between the collected voice data for the user 404 compared to the sound collection channels located at the region S2. Because there is a difference in the location of the user 404 and the distance between the areas S1 and S2, there is a phase difference. It is understood that there may be a phase difference between the voice data collected by the sound collection channels at different locations in the same region.

In the above embodiment, the sound collection device array collects the multiple paths of voice data, so that the phase difference between the multiple paths of voice data has certain regularity, thereby improving the efficiency of predicting the direction of at least one sound source. Moreover, compared with a single sound collection device, the accuracy of collecting the voice data is improved, and the accuracy of predicting the direction of at least one sound source is further improved.

In one embodiment, the image capture device may be a panoramic camera. The panoramic camera is a camera which can take an acquisition point as a center and acquire a panoramic image of 360 degrees around the camera. In other embodiments, the image capturing apparatus may not be a panoramic camera, but may be a single camera capable of rotating by a preset angle (for example, 360 degrees), or at least two cameras capable of rotating. In this way, it may still be sufficient to be able to perform image acquisition in accordance with at least one sound source direction that is arbitrarily predicted. It can be understood that the image is collected by the panoramic camera, the integrity of the image can be improved, and the accuracy of sound source positioning is improved.

FIG. 5 is a schematic diagram of an image acquisition device in one embodiment. Referring to fig. 5, the panoramic camera can acquire images of 360 degrees around, and no dead angle exists in image acquisition.

And S308, determining a final sound source direction from the at least one predicted sound source direction according to the matching degree between the posture characteristics and the voice data.

It can be understood that the morphological characteristics of the speech expression portion vary with the utterance of the target object, i.e., the morphological characteristics of the speech expression portion are different when the target object utters different sounds. Therefore, the morphological feature of the speech expression portion can reflect the sound content of the target object.

Therefore, the computer device can match the posture characteristics of the language expression part with the voice data to judge whether the sound content represented by the posture characteristics of the language expression part is matched with the voice data or not, or judge the similarity between the sound content represented by the posture characteristics of the language expression part and the voice data so as to obtain the matching degree. In one embodiment, the degree of matching may include a success of the matching or a failure of the matching. In one embodiment, the degree of matching may also include a degree of similarity between the morphological feature of the linguistic expression portion and the speech data.

It is understood that, when the speech expression portion is a lip, the above-described processing of matching the morphological feature of the speech expression portion with the speech data corresponds to lip recognition processing. That is, it is equivalent to recognizing lip language and matching the recognized lip language with voice data.

It can be understood that, when the at least one predicted sound source direction is at least two, the body state features of the speech expression portion corresponding to the image collected according to each of the at least one predicted sound source directions may be respectively matched with the speech data, so as to obtain the matching degree corresponding to each of the at least one predicted sound source directions.

In one embodiment, when there is only one predicted at least one sound source direction, the computer device may determine that the predicted at least one sound source direction is a final sound source direction when the matching degree is successful or the similarity is greater than or equal to a preset threshold. It is understood that when the matching degree is a matching failure or the similarity is smaller than a preset threshold, it is determined that the predicted at least one sound source direction does not belong to the final sound source direction. Then, the computer device may then re-perform step S304 based on the matching degree to re-correct the predicted at least one sound source direction.

In one embodiment, when the predicted at least one sound source direction is at least two, the computer device may determine, according to the matching degree, a probability that each predicted at least one sound source direction belongs to the final sound source direction, and further select the predicted at least one sound source direction with the highest probability as the final sound source direction.

It is to be understood that the computer device may determine the final sound source direction from the predicted at least one sound source direction based on only one factor of the degree of matching. The computer device may also determine a final sound source direction from the predicted at least one sound source direction, using the degree of matching and other factors as a few.

The sound direction positioning processing method predicts at least one sound source direction by voice data collected from the environment, and collects images according to the predicted at least one sound source direction; and determining the final sound source direction according to the matching degree between the body state characteristics of the language expression part of the target object in the acquired image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multimodal data. Therefore, even in a noisy environment, the sound source direction can be positioned in an auxiliary mode through the body state characteristics of the language expression part in the image, and therefore compared with the traditional method that the sound direction is positioned only according to the sound data, the positioning accuracy is improved. Compared with the traditional method of singly using voice to position, the method has the advantages that the accuracy rate is improved by 45%, and the interference of more than 90% of environmental noise is effectively filtered.

In one embodiment, the predicted at least one sound source direction is at least two; determining a final sound source direction from the predicted at least one sound source direction according to a matching degree between the posture features and the voice data includes: acquiring predicted direction values corresponding to the predicted sound source directions respectively; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and the matching degree corresponding to the image collected according to the predicted at least one sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

Wherein the prediction direction value is used to characterize a probability that the speech data originates from the predicted direction of the at least one sound source. A sound source direction value for characterizing a probability that the predicted at least one sound source direction is the final sound source direction.

It is to be understood that, in step S304, the computer device may perform prediction of the sound source direction for the speech data, resulting in a predicted direction value for each predicted sound direction. A predicted direction value for characterizing a probability that the speech data originates from the predicted direction of the at least one sound source. In one embodiment, the computer device may perform sound source Direction prediction on the voice data by a DOA (Direction of arrival) algorithm to obtain a prediction Direction value for each predicted sound Direction.

It can be understood that the matching degree corresponding to the image collected according to the predicted at least one sound source direction is the matching degree between the morphological characteristics of the language expression part of the target object in the image collected according to the predicted at least one sound source direction and the voice data.

The computer device may determine a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and a matching degree corresponding to an image collected in the predicted at least one sound source direction. Thus, each predicted sound source direction corresponds to a sound source direction value. The computer device may determine therefrom a maximum sound source direction value and take the predicted sound source direction corresponding to the maximum sound source direction value as the final sound source direction.

In one embodiment, the matching degree is a similarity value between the morphological feature of the linguistic expression part and the voice data. The computer device can determine a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the sound source direction and a similarity value corresponding to an image collected according to the predicted sound source direction.

Specifically, the computer device may directly add the predicted direction value and the similarity value to obtain a sound source direction value corresponding to the sound source direction. The computer device may also perform weighted average calculation on the predicted direction value and the similarity value according to preset confidence coefficients, respectively, to obtain a sound source direction value corresponding to the sound source direction.

It can be understood that the preset confidence coefficient may be a confidence coefficient with a priori value, which is obtained by analyzing based on prior data and conforms to a natural law, and is not artificially and randomly specified.

In one embodiment, the linguistic expression site is a lip. It can be understood that the matching degree is the lip recognition result, and is equivalent to the similarity value obtained after the lip recognition processing. The computer device may determine a sound source direction value corresponding to the sound source direction according to the following formula:

vd ═ DOA × W1+ lip (face) × W2;

vd is a sound source direction value for each predicted sound direction. The DOA is a predicted direction value corresponding to the predicted sound source direction calculated according to the DOA algorithm. The lip shape (face) represents a lip shape recognition result (i.e., a similarity value). The face in brackets is used to indicate that the face region is located from the image, and then the lip is located for lip recognition. W1 and W2 are confidence coefficients corresponding to the predicted direction value and the lip recognition result, respectively.

In the above embodiment, the sound source direction values corresponding to the predicted sound source directions are determined according to the predicted direction value corresponding to the predicted at least one sound source direction and the matching degree corresponding to the image collected according to the predicted at least one sound source direction, which is equivalent to that after the at least one sound source direction is predicted through voice recognition, the voice recognition result and the image recognition result are combined to perform advanced prediction, so that the accuracy of the finally determined sound source direction can be improved.

In one embodiment, the target object is a human object; the language expression part is lip; the morphological characteristics of the speech expression site are mouth shape characteristics. In this embodiment, identifying the morphological characteristics of the language expression portion of the target object from the acquired image includes: locating a face region of a person object from the acquired image; identifying a lip region in the face region; mouth shape features are extracted from the lip region.

It will be appreciated that other objects than human objects may be included in the captured image. Therefore, the computer device can locate the face region of the human object from the acquired image. Further, a lip region is identified from the located face region.

Specifically, the computer device may match objects in the captured image with a preset human object template, thereby locating a face region of the human object from the captured image. The computer device may also perform a convolution process on the image to locate the face region therefrom.

In one embodiment, the computer device may determine a corresponding region from the located face region according to a preset lip position, and use the determined region as the lip region. In another embodiment, the computer device may also convolve the face region image with a pre-trained convolutional neural network to identify the lip region therefrom.

Further, the computer device may extract a lip shape feature from the identified lip region.

It is understood that when the captured image is a single image, the computer device may determine whether the human subject is speaking or not according to the mouth shape feature corresponding to the captured image. In one embodiment, when it is determined that the human subject is speaking, then it may be determined that the mouth shape characteristics match the speech data. In another embodiment, when it is determined that the person object utters a speech, the computer device may further identify the identity of the person object through face recognition, and based on the identity, find the voiceprint feature data corresponding to the person object. The computer device may extract voiceprint feature data from the voice data, match the searched voiceprint feature data with the extracted voiceprint feature data, and if the voiceprint feature data is matched with the extracted voiceprint feature data, determine that the mouth shape feature of the person object is matched with the voice data.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence. In this embodiment, the step S308 of determining a final sound source direction from the at least one predicted sound source direction according to the matching degree between the posture characteristic and the speech data includes: recognizing continuous mouth shape characteristics to obtain a first sentence; carrying out voice recognition on the voice data to obtain a second statement; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristic and the voice data; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

Specifically, the computer device may perform sentence recognition processing (i.e., perform lip language recognition processing) on the continuous mouth shape features, resulting in a first sentence. The computer device may perform speech recognition processing on the speech data to obtain a second sentence. The computer device can perform matching processing on the first sentence and the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data.

In the above embodiment, the face region of the human object is located from the acquired image; identifying a lip region in the face region; mouth shape features are extracted from the lip region. The lip shape feature in the speaking process can be accurately extracted, and therefore the accuracy of the lip shape auxiliary positioning sound production direction is improved.

In one embodiment, the image collected in the final sound source direction includes at least two target objects; the matching degree between the posture characteristic of the language expression part and the voice data is a first matching degree. As shown in fig. 6, the method further includes a step of recognizing and processing the sound-generating object, specifically including the steps of:

s602, extracting the voiceprint feature data of the voice data.

Here, the voiceprint feature refers to a feature that is unique in the sound of each object and is different from other objects. The final sound source direction is the direction from which the sound is emitted that is finally determined. The sound-emitting object is the object emitting sound.

It is understood that when only one target object is included in the image collected according to the final sound source direction, the target object may be directly determined as a sound emission object. When the image collected in accordance with the final sound source direction includes at least two target objects, the sound-emitting object can be identified from the plurality of target objects in accordance with the processing of steps S602 to 608.

S604, searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data.

The stored voiceprint feature data are pre-stored voiceprint feature data.

It can be understood that the stored voiceprint feature data may not have the voiceprint feature data of each target object, so when the voiceprint feature data of each target object is found from the stored voiceprint feature data, steps S606 to S608 are executed again. When the voiceprint feature data of each target object is not found from the stored voiceprint feature data, no processing is required.

In one embodiment, the computer device may first identify the identity of each target object through extrinsic feature recognition, and then search the voiceprint feature data of each target object from the stored voiceprint feature data according to the determined identity.

In one embodiment, the step S604 of searching the stored voiceprint feature data for each target object includes: for each target object, extracting extrinsic feature data of each target object from the image collected according to the final sound source direction; matching the extrinsic feature data with stored extrinsic feature data of the target object; and acquiring voiceprint characteristic data which are stored correspondingly to the matched extrinsic characteristic data to obtain the voiceprint characteristic data of each target object.

Specifically, the computer device may perform extrinsic feature recognition processing on each target object in the image acquired according to the final sound source direction, extract extrinsic feature data corresponding to each target object, match the extracted extrinsic feature data with pre-stored extrinsic feature data, and acquire voiceprint feature data corresponding to the matched extrinsic feature data, to obtain the voiceprint feature data of each target object. The method is equivalent to searching the pre-stored voice print characteristics of the person according to the corresponding storage relation between the external characteristic data and the voice print data, and can quickly and accurately determine the voice print characteristics.

In one embodiment, the target object is a human object; the extrinsic feature data includes face feature data. Extracting extrinsic feature data of each target object from the image collected according to the final sound source direction comprises the following steps: locating a face region corresponding to each target object from an image collected according to the final sound direction; and carrying out face recognition on the face area to obtain face characteristic data.

And S606, matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain second matching degrees corresponding to the target objects.

Specifically, for each target object, the computer device may perform similarity matching between the voiceprint feature data extracted for the target object and each searched voiceprint feature data to obtain a second matching degree corresponding to the target object.

It is to be understood that the second degree of match may comprise a similarity value. The second degree of matching may also include a success of the matching or a failure of the matching.

And S608, identifying the sound-producing object from the target objects according to the first matching degree and the second matching degree.

The first matching degree is the matching degree between the body state characteristics of the language expression part of the target object in the image collected according to the final sound source direction and the voice data. It will be appreciated that each target object in the image acquired in the final sound source direction has a corresponding first degree of matching.

In particular, the computer device may identify the sound-emitting object from the target object in combination with the first degree of matching and the second degree of matching.

It is to be understood that the computer device may identify the sound-generating object from the target objects based on the first degree of matching and the second degree of matching. The computer device may also perform analysis to identify the sound-emitting object from the target objects based on factors other than the first matching degree and the second matching degree, such as a predicted direction value of a predicted sound source direction.

In the embodiment, the acoustic object is assisted to be positioned by combining multi-modal data such as image recognition and voiceprint recognition, and the accuracy of positioning the acoustic object is improved.

In one embodiment, identifying the sound-emitting object from the target object according to the first matching degree and the second matching degree comprises: obtaining a prediction direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound production object.

It can be understood that, because different sound-emitting objects can influence the prediction of the sound source direction, the final sound source direction can also influence the recognition of the sound-emitting objects to a certain extent. Therefore, it is possible to take into account the factor of the sound source direction when recognizing the sound emission target.

Specifically, the computer device may obtain a predicted direction value corresponding to the final sound source direction. For the same target object, the computer device may directly sum the predicted direction value and the first matching degree and the second matching degree corresponding to the target object to obtain the sound direction value of each target object. The computer device may also perform weighted average calculation on the predicted direction value and the first matching degree and the second matching degree corresponding to the target object according to preset confidence coefficients respectively for the same target object, so as to obtain the sound direction value of each target object.

It can be understood that when the sound generating object is identified, the multi-dimensional and multi-modal factors such as the predicted direction value of the sound source direction, the image identification result, the voice print identification result and the like are considered, and the accuracy of positioning the sound generating object is improved.

In one embodiment, the target object is a human object; the language expression site is the lips. The computer device may determine a sound direction value of the target object according to the following formula:

pd ═ DOA × W1+ lip (face) × W2+ voiceprint × W3;

where Pd is the sound direction value of each target object. The DOA is a predicted direction value corresponding to the final sound source direction calculated according to the DOA algorithm. The lip shape (face) represents the lip shape recognition result (i.e., the first degree of matching). The voiceprint represents the voiceprint recognition result (i.e., the second degree of matching). The face in brackets is used to indicate that the face region is located from the image, and then the lip is located for lip recognition. W1, W2, and W3 are confidence coefficients corresponding to the prediction direction value, the lip recognition result, and the voiceprint recognition result, respectively.

In one embodiment, the first matching degree may include a first similarity value between the morphological feature of the linguistic expression portion of the target object and the speech data. The second matching degree may include a second similarity value between the extracted voiceprint feature data for the target object and each searched voiceprint feature data.

It is understood that it is equivalent to recognize the sound-emitting object by comprehensively considering several factors of the sound source direction, the voiceprint feature and the lip recognition result. The accuracy of positioning the sounding object can be improved. In other embodiments, the computer device may further incorporate the human face feature factor to identify the sound generating object. For example, the face feature data extracted for the target object and the third matching degree between the found face feature data are combined, and the uttered object is identified from the target object according to the predicted direction value, the first matching degree (lip recognition result), the second matching degree (voiceprint recognition result), the third matching degree (face recognition result), and other factors. The accuracy of positioning the sound-emitting object is further improved.

It will be appreciated that in one embodiment, the method further comprises: extracting voice print characteristic data of the voice data; searching voiceprint characteristic data matched with the voiceprint characteristic data from the stored voiceprint characteristic data, and further searching extrinsic characteristic data correspondingly stored with the matched voiceprint characteristic data; in the image collected according to the final sound source direction, performing extrinsic feature identification on each target object, and extracting corresponding extrinsic feature data; respectively matching each extracted extrinsic feature data with the found extrinsic feature data; and obtaining the degree of matching of the external features. Furthermore, the computer device may identify the sound-emitting object from the target object according to the predicted direction value corresponding to the final sound source direction and the extrinsic feature matching degree.

Further, the computer device may further search the stored voiceprint feature data of each target object according to the extracted extrinsic feature data of each target object, and match the searched voiceprint feature data with the voiceprint feature data extracted from the voice data to obtain a voiceprint recognition result.

Similarly, the computer device may identify the sound-generating object from the target object according to the predicted direction value corresponding to the final sound source direction, the body state identification result (for example, lip shape identification result) of the language expression portion, the degree of matching of external features, and the voiceprint identification result.

It should be noted that, as can be seen from the different embodiments described above, the sequence of the extrinsic feature recognition and the sequence of the voiceprint recognition may be different, and the processing of the extrinsic feature and the voiceprint recognition can be implemented, where the sequence of the processing is not limited.

In one embodiment, the method further comprises: and aiming at the sound-producing object, storing the extracted external characteristic data and the extracted voiceprint characteristic data of the sound-producing object corresponding to the sound-producing object so as to update the stored external characteristic data and voiceprint characteristic data corresponding to the sound-producing object.

It is understood that the external feature data extracted by identification and the voiceprint feature data are correspondingly stored for corresponding marking. The method can play a role in assisting in correcting the stored voiceprint characteristic data and the stored extrinsic characteristic data, so that the stored voiceprint characteristic data and the stored extrinsic characteristic data can be used for positioning the sound-emitting object more quickly and more accurately.

In one embodiment, the computer device may further perform a final sound source direction speech enhancement process on the speech data and the speech data, attenuate noise data in other unrelated directions, and save the enhanced speech data corresponding to the final sound source direction. Further, the enhanced voice data can be utilized to perform voice recognition processing, thereby improving the accuracy of voice recognition.

In one embodiment, the computer device may use beamforming (a signal processing technique to control the direction of propagation and reception of radio frequency signals) algorithms to enhance the final voice data in the direction of the sound source, filtering out uncorrelated directional noise.

FIG. 7 is a simplified flowchart of a sound direction localization processing method in one embodiment. Referring to fig. 7, when the system is turned on, first, it is detected whether there is voice input in the environment through vad (voice Activity detection) voice detection algorithm of the microphone, and if there is voice input, the DOA algorithm module is turned on to calculate the voice source angle to predict at least one source direction. Meanwhile, the camera synchronously captures images in the predicted sound source direction, whether a person exists is determined, if yes, the final sound source direction is positioned through lip shape identification, and the speaker is positioned by combining voiceprint identification. The microphone array opens the voice enhancement algorithm module to enhance the voice in the speaker direction, simultaneously attenuates the noise in other irrelevant directions, records the final voice with the enhanced sound source direction, and can perform voice recognition subsequently. And subsequently, face recognition can be carried out to verify the recognized speaker, and the face characteristic data and the voiceprint characteristic data are correspondingly stored. Finally, the enhanced voice data, the voiceprint feature data, the face feature data, the lip feature data (i.e. the mouth shape feature data) and the final sound source direction can be correspondingly stored, and the voiceprint feature data, the face feature data, the lip feature data and the like which are stored before can be assisted to be corrected according to the stored data.

As shown in fig. 8, in one embodiment, a sound direction positioning processing apparatus 800 is provided and disposed on a computer device. The computer device may be a terminal or a server. The apparatus 800 comprises: a direction prediction module 802, an image acquisition module 804, and a direction location module 806, wherein:

a direction prediction module 802 for obtaining voice data collected from an environment; at least one sound source direction is predicted based on the speech data.

And the image acquisition module 804 is used for acquiring images according to the predicted at least one sound source direction and identifying the body state characteristics of the language expression part of the target object from the acquired images.

And a direction positioning module 806, configured to determine a final sound source direction from the predicted at least one sound source direction according to a matching degree between the body state feature and the voice data.

In one embodiment, the voice data is at least two paths of voice data sent from the same sound source; the predicted sound source direction includes a sound emission angle. The direction prediction module 802 is further configured to determine a phase difference between each path of voice data; and predicting the corresponding sound production angle of the voice data according to the phase difference.

In one embodiment, the direction prediction module 802 is further configured to collect voice data from an environment through a sound collection device array, so as to obtain at least two paths of voice data from the same sound source; the sound collection equipment array comprises at least two sound collection equipment; and controlling the image acquisition equipment to acquire images according to the predicted at least one sound source direction in the process of keeping the sound acquisition equipment array to acquire the voice data.

In one embodiment, the predicted at least one sound source direction is at least two. The direction positioning module 806 is further configured to obtain predicted direction values corresponding to the predicted sound source directions, respectively; predicting a direction value for representing a probability that the voice data is derived from the direction of the sound source; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image collected according to the predicted sound source direction; a sound source direction value for representing a probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a human object; the language expression part is lip; the morphological characteristics of the speech expression site are mouth shape characteristics. The direction positioning module 806 is further configured to position a face region of the person object from the acquired image; identifying a lip region in the face region; mouth shape features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence. The direction positioning module 806 is further configured to identify continuous mouth shape features to obtain a first sentence; carrying out voice recognition on the voice data to obtain a second statement; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristic and the voice data; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

a sound object recognition module 808, configured to extract voiceprint feature data of the voice data; searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint characteristic data with the searched voiceprint characteristic data to obtain second matching degrees corresponding to the target objects; and identifying the sound-producing object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound object identification module 808 is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound production object.

In one embodiment, the acoustic object recognition module 808 is further configured to, for each target object, extract extrinsic feature data of each target object from the image collected in the final sound source direction; matching the extrinsic feature data with stored extrinsic feature data of the target object; and acquiring voiceprint characteristic data which are stored correspondingly to the matched extrinsic characteristic data to obtain the voiceprint characteristic data of the target object.

In one embodiment, the target object is a human object; the extrinsic feature data includes face feature data. The spoken object recognition module 808 is further configured to locate a face region corresponding to each target object from the image collected in the final sound direction; and carrying out face recognition on the face area to obtain face characteristic data.

As shown in fig. 9, in one embodiment, the apparatus 800 further comprises: a spoken object identification module 808 and an update storage module 810; wherein:

the update storage module 810 is configured to, for the sound-generating object, store the extracted extrinsic feature data and the extracted voiceprint feature data of the sound-generating object corresponding to the sound-generating object, so as to update the stored extrinsic feature data and voiceprint feature data corresponding to the sound-generating object.

The sound direction positioning processing device predicts at least one sound source direction by voice data collected from the environment, and performs image collection according to the predicted at least one sound source direction; and determining the final sound source direction according to the matching degree between the body state characteristics of the language expression part of the target object in the acquired image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multimodal data. Therefore, even in a noisy environment, the sound source direction can be positioned in an auxiliary mode through the body state characteristics of the language expression part in the image, and therefore compared with the traditional method that the sound direction is positioned only according to the sound data, the positioning accuracy is improved.

FIG. 10 is a block diagram of a computer device in one embodiment. Referring to fig. 10, the computer device may be the terminal 110 of fig. 1. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and a computer program. The computer program, when executed, causes a processor to perform a sound direction localization processing method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may store a computer program that, when executed by the processor, causes the processor to perform a sound direction localization processing method. The network interface of the computer device is used for network communication. The display screen of the computer device can be a liquid crystal display screen or an electronic ink display screen. The input device of the computer equipment can be a touch layer covered on a display screen, a key, a track ball or a touch pad arranged on a terminal shell, an external keyboard, a touch pad or a mouse and the like. The computer device may be a personal computer, a smart speaker, a mobile terminal or a vehicle-mounted device, and the mobile terminal includes at least one of a mobile phone, a tablet computer, a personal digital assistant or a wearable device.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the sound direction localization processing apparatus provided in the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 10, and a nonvolatile storage medium of the computer device may store the respective program modules constituting the sound direction localization processing apparatus. Such as the direction prediction module 802, the image acquisition module 804, and the direction location module 806 shown in fig. 8. The computer program composed of the respective program modules is for causing the computer apparatus to execute the steps in the sound direction localization processing method of the respective embodiments of the present application described in the present specification.

For example, the computer device may obtain voice data collected from the environment through the direction prediction module 802 in the sound direction localization processing apparatus 800 shown in fig. 8; at least one sound source direction is predicted based on the speech data. The computer device may perform image acquisition in the predicted at least one sound source direction through the image acquisition module 804, and recognize a body state feature of the language expression part of the target object from the acquired image. The computer device may determine a final sound source direction from the predicted at least one sound source direction according to a degree of matching between the body state feature and the voice data through the direction localization module 806.

In one embodiment, there is provided a sound direction localization processing system, the system comprising: a sound collection device and an image collection device; wherein:

a sound collection device for acquiring voice data collected from an environment; at least one sound source direction is predicted based on the speech data.

And the image acquisition equipment is used for carrying out image acquisition according to the predicted at least one sound source direction and identifying the body state characteristics of the language expression part of the target object from the acquired image.

In one embodiment, the voice data is at least two paths of voice data sent from the same sound source; the predicted sound source direction includes a sound emission angle; the sound acquisition equipment is also used for determining the phase difference between each path of voice data; and predicting the corresponding sound production angle of the voice data according to the phase difference.

In one embodiment, the sound collection device is an array of sound collection devices; wherein:

the sound acquisition equipment array is used for acquiring voice data from the environment to obtain at least two paths of voice data sent from the same sound source; the sound collection equipment array comprises at least two sound collection equipment; and controlling the image acquisition equipment to acquire images according to the predicted at least one sound source direction in the process of keeping the sound acquisition equipment array to acquire the voice data.

In one embodiment, the predicted at least one sound source direction is at least two. The sound collection equipment is also used for acquiring predicted direction values corresponding to the predicted sound source directions respectively; predicting a direction value for representing a probability that the voice data is derived from the direction of the sound source; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and the matching degree corresponding to the image collected according to the predicted at least one sound source direction; a sound source direction value for representing a probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a human object; the language expression part is lip; the morphological characteristics of the language expression part are mouth shape characteristics; the image acquisition equipment is also used for positioning a human face area of the person object from the acquired image; identifying a lip region in the face region; mouth shape features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence; the sound acquisition equipment is also used for identifying continuous mouth shape characteristics to obtain a first sentence; carrying out voice recognition on the voice data to obtain a second statement; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristic and the voice data; and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree.

In one embodiment, the image collected in the final sound source direction includes at least two target objects; the matching degree is a first matching degree; the sound acquisition equipment is also used for extracting voiceprint characteristic data of the voice data; searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint characteristic data with the searched voiceprint characteristic data to obtain second matching degrees corresponding to the target objects; and identifying the sound-producing object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound collection device is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value and the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound production object.

In one embodiment, the sound collection device is further configured to notify the image collection device to extract, for each target object, extrinsic feature data of each target object from the image collected in the final sound source direction; matching the extrinsic feature data with stored extrinsic feature data of the target object; the sound acquisition equipment is further used for acquiring voiceprint feature data stored corresponding to the matched extrinsic feature data according to the matched extrinsic feature data provided by the image acquisition equipment to obtain the voiceprint feature data of the target object.

In one embodiment, the target object is a human object; the extrinsic feature data comprises face feature data; the image acquisition equipment is also used for positioning a face area corresponding to each target object from the image acquired according to the final sound direction; and carrying out face recognition on the face area to obtain face characteristic data.

In one embodiment, the sound collection device is further configured to store, for the sound-generating object, the extracted extrinsic feature data and voiceprint feature data of the sound-generating object corresponding to the sound-generating object, so as to update the stored extrinsic feature data and voiceprint feature data corresponding to the sound-generating object.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the sound direction localization processing method described above. Here, the steps of the sound direction localization processing method may be the steps in the sound direction localization processing method of each of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, causes the processor to carry out the steps of the sound direction localization processing method described above. Here, the steps of the sound direction localization processing method may be the steps in the sound direction localization processing method of each of the above embodiments.

It should be noted that "first" and "second" in the embodiments of the present application are used only for distinction, and are not used for limitation in terms of size, order, dependency, and the like.

It should be understood that although the individual steps in the embodiments of the present application are not necessarily performed in the order indicated by the step numbers. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a non-volatile computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A sound direction localization processing method, the method comprising:

acquiring voice data collected from an environment;

predicting at least one sound source direction based on the speech data;

and determining a final sound source direction from the at least one predicted sound source direction according to the matching degree between the posture characteristic and the voice data.

2. The method according to claim 1, wherein the voice data is at least two voice data from the same sound source; the predicted sound source direction includes a sound emission angle;

the predicting at least one sound source direction according to the voice data includes:

determining phase differences among the voice data of each path;

3. The method of claim 1, wherein the obtaining speech data collected from an environment comprises:

4. The method of claim 1, wherein the predicted at least one sound source direction is at least two; the determining a final sound source direction from the predicted at least one sound source direction according to the matching degree between the posture features and the voice data includes:

acquiring predicted direction values corresponding to the predicted sound source directions respectively; the predicted direction value is used for representing the probability that the voice data is derived from the sound source direction;

determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to an image collected according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction;

5. The method of claim 1, wherein the target object is a human subject; the language expression part is lip; the morphological characteristics of the language expression part are mouth shape characteristics;

the recognizing the posture characteristics of the language expression part of the target object from the acquired image comprises the following steps:

locating a face region of a person object from the acquired image;

identifying a lip region in the face region;

mouth shape features are extracted from the lip region.

6. The method of claim 5, wherein the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence;

the determining a final sound source direction from the predicted at least one sound source direction according to the matching degree between the posture features and the voice data includes:

recognizing the continuous mouth shape features to obtain a first sentence;

performing voice recognition on the voice data to obtain a second sentence;

7. The method according to any one of claims 1 to 6, wherein at least two target objects are included in the image acquired in the final sound source direction; the matching degree is a first matching degree;

the method further comprises the following steps:

extracting voiceprint feature data of the voice data;

and identifying a sound-producing object from the target object according to the first matching degree and the second matching degree.

8. The method of claim 7, wherein the identifying the sound-generating object from the target object according to the first degree of match and the second degree of match comprises:

9. The method of claim 7, wherein the searching the stored voiceprint feature data for each target object comprises:

extracting extrinsic feature data of each target object from the image collected according to the final sound source direction aiming at each target object;

10. The method of claim 9, wherein the target object is a human subject; the extrinsic feature data comprises face feature data;

the extracting extrinsic feature data of each target object from the image collected according to the final sound source direction includes:

positioning a face area corresponding to each target object from the image collected according to the final sound direction;

11. The method of claim 9, further comprising:

and aiming at the sound-producing object, storing the extracted extrinsic feature data and the extracted voiceprint feature data of the sound-producing object corresponding to the sound-producing object so as to update the stored extrinsic feature data and voiceprint feature data corresponding to the sound-producing object.

12. A sound direction localization processing apparatus, characterized in that the apparatus comprises:

and the direction positioning module is used for determining the final sound source direction from at least one predicted sound source direction according to the matching degree between the posture characteristic and the voice data.

13. A sound direction localization processing system, the system comprising: a sound collection device and an image collection device;

the sound acquisition equipment is used for acquiring voice data acquired from the environment; predicting at least one sound source direction based on the speech data;

the image acquisition equipment is used for acquiring images according to the predicted at least one sound source direction and identifying the body state characteristics of the language expression part of the target object from the acquired images;

the sound collection device is further configured to determine a final sound source direction from the predicted at least one sound source direction according to a matching degree between the posture characteristic and the voice data.

14. A computer arrangement comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1 to 11.

15. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 11.