CN110730378A - Information processing method and system - Google Patents

Information processing method and system Download PDF

Info

Publication number
CN110730378A
CN110730378A CN201911059546.3A CN201911059546A CN110730378A CN 110730378 A CN110730378 A CN 110730378A CN 201911059546 A CN201911059546 A CN 201911059546A CN 110730378 A CN110730378 A CN 110730378A
Authority
CN
China
Prior art keywords
information
target
scene
image
target scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911059546.3A
Other languages
Chinese (zh)
Inventor
董碧涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201911059546.3A priority Critical patent/CN110730378A/en
Publication of CN110730378A publication Critical patent/CN110730378A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses an information processing method and system, wherein audio information meeting target conditions is generated in response to a target scene, and a first identification method is used for acquiring information of the target scene to obtain azimuth information; identifying the target space matched with the azimuth information by using a second identification method, and determining a target object generating audio information; and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object. The target in the target scene is positioned and accurately identified, the multimedia information of the identified target can be known by the target in the target scene, and the experience effect of the target in the target scene is improved.

Description

Information processing method and system
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an information processing method and system.
Background
With the application and development of internet technology, in a scene where a plurality of participants communicate or conduct a work discussion, images of the participants are collected by using an image collecting device. However, in the existing scene, for example, in a conference scene, it is impossible for a participant to obtain specific image information of an object that needs to be focused on in real time, for example, it is impossible to obtain a facial expression of a conference speaker, and thus the experience effect of the participant is poor.
Disclosure of Invention
In view of this, the present application provides the following technical solutions:
an information processing method comprising:
generating audio information meeting target conditions in response to a target scene, and acquiring information of the target scene by using a first identification method to obtain azimuth information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
identifying a target space matched with the azimuth information by using a second identification method, and determining a target object generating the audio information, wherein the target space comprises at least one object;
and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
Optionally, the method further comprises:
in response to a target scene not producing audio information satisfying the target condition, generating multimedia information matching the target scene;
and outputting the multimedia information matched with the target scene.
Optionally, the first recognition method represents a pickup system recognition method, where the pickup system includes a first pickup device and a second pickup device, where the acquiring information of the target scene by using the first recognition method to obtain the azimuth information includes:
acquiring audio information of the target scene by using the first sound pickup device and the second sound pickup device respectively to obtain first time information and second time information;
and calculating to obtain azimuth information according to the first time information and the second time information.
Optionally, the first identification method represents a position identification method, and acquiring information of the target scene by using the first identification method to obtain the azimuth information includes:
determining a first position of the target scene according to the audio information;
acquiring distance information between the first position and a preset reference position;
and calculating to obtain azimuth information by using the distance information.
Optionally, the first recognition method represents an image recognition method, and acquiring information of the target scene by using the first recognition method to obtain the orientation information includes:
acquiring an image of an object of the target scene to obtain an acquired image;
in response to a change in a biometric characteristic of a first object in the captured image, orientation information of the first object is obtained from the captured image that matches the first object.
Optionally, the identifying the target space matched with the orientation information by using a second identification method to determine a target object generating the audio information includes:
obtaining a captured image of at least one object of a target space that matches the orientation information;
carrying out feature recognition on the collected image, and determining a collected sub-image meeting a target feature condition;
and determining a target object for generating the audio information according to the acquisition sub-images.
Optionally, the method further comprises:
and generating a control instruction for image acquisition by using the orientation information, wherein the control instruction is used for controlling the image acquisition equipment to acquire an image of a target space matched with the orientation information, so that the image acquisition equipment outputs an image including a target object.
Optionally, the outputting the multimedia information matched with the target object includes:
and if the audio time length generated by the target object is greater than a time length threshold value, outputting multimedia information matched with the target object.
Optionally, the outputting the multimedia information matched with the target object includes:
acquiring identification information matched with the target object;
combining the information to be output corresponding to the target object with the identification information to obtain multimedia information matched with the target object;
and outputting the multimedia information.
An information processing system comprising:
the azimuth acquisition equipment is used for responding to the audio information which meets the target condition and is generated by the target scene, and acquiring information of the target scene by using a first identification method to obtain azimuth information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
a target recognition device for recognizing a target space matched with the azimuth information by using a second recognition method, and determining a target object for generating the audio information, wherein the target space comprises at least one object;
and the information output equipment is used for outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
The application discloses an information processing method and system, wherein audio information meeting target conditions is generated in response to a target scene, and a first identification method is used for acquiring information of the target scene to obtain azimuth information; identifying the target space matched with the azimuth information by using a second identification method, and determining a target object generating audio information; and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object. The target in the target scene is positioned and accurately identified, the multimedia information of the identified target can be known by the target in the target scene, and the experience effect of the target in the target scene is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart illustrating an information processing method provided in an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a sound pickup system according to an embodiment of the present application;
fig. 3 is a schematic structural diagram illustrating another sound pickup system provided in an embodiment of the present application;
fig. 4 is a schematic flowchart illustrating a method for obtaining orientation information according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a method for determining a target object according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating a display of multimedia information provided by an embodiment of the present application;
fig. 7 shows a schematic structural diagram of an information processing system according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An information processing method is provided in an embodiment of the present application, and referring to fig. 1, a flowchart of an information processing method of an embodiment of the present application is shown, where the method may include the following steps:
s101, responding to the target scene to generate audio information meeting the target condition, and acquiring information of the target scene by using a first identification method to obtain azimuth information.
The target scene comprises at least one object which is able to learn about changes to transactions in the target scene, i.e. which has a visual perception function. For example, if the target scene represents a conference scene, the object in the target scene is a conference participant corresponding to the conference scene. Because the audio information generated by the target scene may not be unique, a target condition needs to be set to obtain the audio information in the target scene, so that the audio information of the target scene is monitored, and meanwhile, the audio information can be accurately obtained, and the interference of irrelevant audio is avoided. The target condition may characterize a volume condition of the audio, e.g., the target condition is a set volume threshold; the target condition may also characterize the timbre condition of the audio, e.g., the target condition is set to a condition representing adult timbres; the target condition may also characterize the condition of the relevant audio features set according to the object to be monitored.
When the situation that the audio information meeting the target condition is generated in the target scene is monitored, information acquisition is carried out on the target scene by using a first identification method, and azimuth information is obtained. The first recognition method can utilize a pickup system to correspondingly position a target scene because the direction information is collected by utilizing the first recognition method when the audio information is generated; on the other hand, after the audio information is obtained, the orientation information can be determined according to the distance relation between the possible generation position of the audio information and the preset reference point. In the embodiment of the present application, in the process of collecting the direction information, a sound pickup system capable of recognizing sound may be used to collect the direction information, and the direction information may also be obtained by collecting related distance information through a sensor capable of obtaining distance information. The specific determination process of the orientation information will be specifically explained in the following embodiments of the present application.
And S102, identifying the target space matched with the azimuth information by using a second identification method, and determining a target object generating the audio information.
The first identification method is based on the identification of orientation information after the target scene information is collected. The second recognition method is a method of re-recognizing the target space corresponding to the orientation information after the orientation information is determined by using the first recognition method, and mainly recognizes some specific feature information. A target space corresponding to the orientation information characterizes a region determined from the orientation information, the target space comprising at least one object.
In the embodiment of the application, the target space can be directly determined according to the azimuth information, for example, the azimuth information represents the information of the center origin and the radius, that is, the target circle can be determined according to the azimuth information, and the space corresponding to the target origin can be used as the target space for the next step of identification. Still further, the above-described target circle is taken as an example, and a space larger than or smaller than the target circle range may be determined as the target space. When a space larger than the target circle range is used as the target space, the recognition range can be expanded, and the problem of recognition omission can be avoided. When a space smaller than the target circle range is used as the target space, the interference range needs to be filtered, such as the range of a static object, so that the waste of identification resources is reduced while accurate identification can be performed.
In this embodiment, the target space is determined by using the position information as the center origin and the relevant information of the radius, but the target space may be determined according to other types of position information, for example, the position information may include boundary coordinates of the area, and then the target space is determined according to the boundary coordinates. Also, the target space in the embodiment of the present application refers to a spatial range including at least one object to be recognized.
When the second recognition method is used to recognize the target space, it is mainly to recognize the object in the target space, that is, it is necessary to determine the feature information of the object in the target space, so as to accurately determine whether the object in the target space is the target object generating the audio information. The determination result may be obtained by determining an image feature of the target spatial object or a biological feature of the object.
Taking a conference scene as an example, in order to identify a speaker in the conference scene, the direction information of the speaker needs to be determined by a first identification method, so that a target space including the speaker can be obtained, and then a second identification method is further used to perform facial expression identification on the person in the target space to determine the speaker according with the expression characteristics of the speaking, that is, the target object is obtained.
And S103, outputting the multimedia information matched with the target object.
Multimedia information is media information in the form of characters, images, videos, sounds, moving pictures, or the like, and is generally understood to mean related information obtained by using a storage and retrieval technique, particularly digital information in a computer. In the present application, the multimedia information may refer to information in a format including an image, a video, and the like of a target object, or may be combined information of information such as a text, an image, and an audio. After the multimedia information is output, at least one object in the target scene can acquire the multimedia information of the target object.
For example, the current target scene representation includes a lecture scene of a plurality of participants, and when a sound is detected in the scene, the decibel value of the sound is higher than a preset sound threshold. Then, the target scene information is collected by using a first recognition method, namely, the azimuth information is determined according to the audio information, the approximate position of the person who sends the sound is obtained, then, the object in the approximate position range is further recognized by using a second recognition method, the speaker who sends the sound is found, and finally, multimedia information corresponding to the speaker, such as a video with the expression of the speaker in the speaking process, is output, so that the participant in the speaking site can know the expression information of the speaker in real time according to the video. Therefore, the participators can further understand the current speech content according to the expression information of the speaker, know the intention of the speaker and achieve the experience effect of matching the speech content with the speaker.
According to the information processing method disclosed by the embodiment of the application, the audio information meeting the target condition is generated in response to the target scene, and the first identification method is utilized to acquire the information of the target scene to obtain the azimuth information; identifying the target space matched with the azimuth information by using a second identification method, and determining a target object generating audio information; and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object. The target in the target scene is positioned and accurately identified, the multimedia information of the identified target can be known by the target in the target scene, and the experience effect of the target in the target scene is improved.
In order to facilitate understanding of the embodiments of the present application, the first recognition method and the second recognition method used in the embodiments of the present application will be specifically described below.
The first identification method can represent a pickup system identification method, wherein pickup is a sound collection process, a pickup system usually comprises a plurality of pickup devices, the common pickup device is provided with a Microphone (MIC), in order to obtain a more accurate pickup result, a Microphone array mode is adopted for pickup processing, the Microphone array is an array formed by arranging a group of omnidirectional microphones at different spatial positions according to a certain shape rule, and is a device for spatially sampling spatially-propagated sound signals, and the collected signals contain spatial position information.
When the first identification method represents an identification method of a sound pickup system, and the sound pickup system includes a first sound pickup apparatus and a second sound pickup apparatus, in this embodiment of the present application, the obtaining of the azimuth information by using the first identification method specifically includes:
s201, respectively utilizing a first sound pickup device and a second sound pickup device to acquire audio information of a target scene, and acquiring first time information and second time information;
s202, calculating to obtain azimuth information according to the first time information and the second time information.
It should be noted that, in the above-described embodiment, although the sound pickup system includes the first sound pickup apparatus and the second sound pickup apparatus, the sound pickup system does not represent only two sound pickup apparatuses in the sound pickup system, and the number of the sound pickup apparatuses can be flexibly set according to the needs of the scene. Specifically, the first sound pickup apparatus may indicate sound pickup apparatuses in the same area, e.g., two sound pickup apparatuses arranged in a horizontal direction; correspondingly, the second sound pickup apparatus may represent a plurality of sound pickup apparatuses arranged in a vertical direction.
Taking a sound pickup device as a microphone array as an example, orientation information is obtained by using a positioning algorithm based on time delay estimation. The method firstly estimates the time difference of sound source signals (namely, signals corresponding to audio information in the embodiment of the application) reaching different microphones, a hyperboloid can be obtained by utilizing the geometrical structure of the array and the position information of the microphones, and the sound source position is obtained by solving a series of nonlinear hyperboloid equations.
Therefore, when the first sound pickup apparatus and the second sound pickup apparatus are used for collecting audio information in the embodiment of the present application, the obtained first time information and second time information respectively represent delay times of sounds picked up by different sound pickup apparatuses in the sound pickup system.
Referring to fig. 2, a schematic diagram of a sound pickup system provided by an embodiment of the present application is shown, where the sound pickup system includes three microphone elements, where a, b, and c respectively represent sound source observation angles corresponding to different microphone elements, R1, R2, and R3 respectively represent distances from different array elements to a sound source, and d1 and d2 respectively represent distances from array element 1 to array element 2, and from array element 2 to array element 3. After the audio information meeting the target condition is acquired, the pickup system can acquire the initial position information of the audio information, and then the microphone elements of the pickup system are utilized to further determine the azimuth information. The time delay information may be obtained by time delay estimation calculation, for example, by performing fourier transform on the sound pressure signal collected by each microphone, so as to obtain a power spectrum of the received signal of each array element. Then, the cross-spectrum time delay estimation is carried out on the power spectrum of the array element 1, the power spectrum of the array element 2 and the power spectrum of the array element 3 respectively to obtain the signal time delay t from the array element 1 to the array element 212Time delay t from array element 2 to array element 323. Then, combining the obtained signal time delay t from the acquisition array element 1 to the array element 212Time delay t from array element 2 to array element 323And the distances d1 and d2 between the array elements can determine the steering vector of beam forming, select the middle array element as the origin of observation coordinates, and use the conventional beam forming algorithm to obtain the observation angle b from the array element to the target sound source. Thereby forming self-spectrum work according to time delay information and beamAnd calculating target distance information and the like of the positioning sound source by using methods such as rate and the like, thereby realizing the positioning of the sound source. Specifically, the sound source localization algorithm may refer to a localization algorithm of time delay estimation, which is a common algorithm and is not described in detail in this application.
Referring to fig. 3, a schematic structural diagram of another pickup system in the embodiment of the present application is shown, where the pickup system includes 5 microphone arrays, namely, Mic0, Mic1, Mic2, Mic3, and Mic4, where Mic1, Mic2, and Mic0 are fixed at a fixed distance a (a preferred value of a obtained by experiments in advance may be greater than or equal to 3 cm). When a person speaks, the sound pickup system calculates the angles between Mic0, Mic1 and Mic2 and the speaker according to the delay time of picking up sound by different microphone arrays, and after the angles are calculated, the vertical distance d between the Mic1 plane and the Mic2 plane can be found. Similarly, the vertical distance from the planes Mic3 and Mic4 can be calculated according to the angles of Mic0, Mic3 and Mic4 to the speaker. In this way, the sound pickup system can preliminarily determine a sound source which is likely to generate audio information, and obtain determined information. The specific calculation process can be referred to the description process of the above embodiment.
A method of obtaining bearing information by a pickup system is shown in the embodiments corresponding to fig. 2 and 3. In this embodiment of the present application, the first identification method may further characterize a position identification method, and correspondingly, information acquisition is performed on a target scene by using the first identification method to obtain azimuth information, referring to fig. 4, which shows a flow diagram of an azimuth information acquisition method provided in this embodiment of the present application, where the method may include the following steps:
s301, determining a first position of a target scene according to the audio information;
s302, collecting distance information between the first position and a preset reference position;
and S303, calculating to obtain azimuth information by using the distance information.
After obtaining the audio information, an initial position at which the audio information is likely to be generated may be obtained as the first position. The distance information between the first position and the reference position may be calculated using the position of a TOF (Time of Flight) sensor as the reference position. And the first position may be the position of an object in the target scene. The TOF sensor sends out modulated near infrared light, the modulated near infrared light is reflected after meeting an object corresponding to the first position, the TOF sensor calculates the time difference or phase difference between light ray emission and reflection to convert the distance of the shot object, and then the distance between the TOF sensor and the object can be obtained. Thereby obtaining orientation information, i.e. the orientation information at that time represents distance information between the object and the sensor. Can be used to initially locate objects that produce audio information.
In another possible implementation manner, information about a distance between the first position and a preset reference position, which represents a position of the image capturing device, may also be captured by the image capturing device. The image capture device may include one or more multi-purpose (e.g., binocular, trinocular, etc.) cameras that may be used to determine the angular orientation and distance of a given object in space relative to the image capture device in the captured optical images.
In some target scenes, since audio information is generated by a living creature such as a person, a biometric feature of the person is changed along with the generation of the audio information. For example, in a class or a lecture scenario where interaction is required, a speaker may change its motion characteristics, e.g., from sitting to standing, when speaking. In these scenarios, the embodiment of the present application may implement, by an image capturing device, a process of obtaining orientation information by using a first identification method, where the process may include the following steps:
s401, carrying out image acquisition on an object of a target scene to obtain an acquired image;
s402, responding to the change of the biological characteristics of the first object in the acquired image, and obtaining the orientation information of the first object according to the acquired image matched with the first object.
In this process, the image capture device may include a camera, which may be a camera capable of 360 degree rotation, and a processor. When the image acquisition equipment acquires images of an object in a target scene, the acquisition process is a real-time process, namely, a corresponding acquired image is reserved at each moment, when the processor identifies a certain object in the acquired image, for example, the biological characteristics of a first object change, the acquired image matched with the first object is recorded, and then the characteristic information in the acquired image is analyzed to obtain the azimuth information of the first object.
Specifically, the biological features are features of objects in the target scene, and may include information such as limb features and head features. The change of the biological characteristic may be characterized in that the first object is changed from the first state to the second state, and the change of the state is realized by the change of the biological characteristic, for example, the process of head swing of the object in the target scene, or the process of posture change of the object is judged as the change of the biological characteristic of the object.
Since the object in which the change of the biometric characteristic occurs may not be unique, a target condition of the change of the biometric characteristic may also be set in this embodiment, for example, the target condition may represent a condition of the magnitude of the change of the biometric characteristic or a condition of the time interval during the change of the biometric characteristic. For example, in a conference scenario, when a participant stands up due to a seat position discomfort and sits down immediately after a seat adjustment, the participant will not be recognized as the first object because the time from sitting to standing to sitting is short and the target condition for the change in the biometric characteristic cannot be achieved.
After determining the first object with the changed biological characteristics, a captured image of the first object is acquired, and then image characteristics included in the captured image are analyzed to obtain orientation information of the first object. The position relation of the target object and the reference position can be analyzed by characteristic information capable of representing the coordinates of the object in the acquired image, such as the coordinate information of the seat. For example, the orientation information may be determined by the object relationship between the first object and the location of the conference host, i.e. the angle and distance information of the first object with respect to the conference host may be obtained.
The above embodiments respectively describe processes for obtaining the azimuth information by different methods, and the method for obtaining the azimuth information in the embodiments of the present application is not limited to the above methods, but may also obtain the azimuth information of an object generating audio information according to a trigger instruction of the object, for example, a start instruction of the object to a microphone, and then trigger a process for acquiring a position of the microphone according to the start instruction, and then determine the azimuth information of the object according to the acquired position information of the microphone.
The purpose of obtaining the orientation information of the object which may generate the audio information in the embodiment of the present application is to more accurately identify and locate the specific information of the object which generates the audio information. Therefore, in an embodiment of the present application, a method for identifying a target space matched with the orientation information by using a second identification method to determine a target object generating audio information is further included, and referring to fig. 5, a flowchart of a method for determining a target object according to an embodiment of the present application is shown, where the method may include the following steps:
s501, acquiring a collected image of at least one object in a target space matched with the azimuth information;
s502, carrying out feature recognition on the collected image, and determining a collected sub-image meeting a target feature condition;
s503, determining a target object for generating audio information according to the collected sub-images.
In order to realize the identification of the accurate object, a target space needs to be determined after obtaining the azimuth information that may generate audio information, where the target space is determined according to the azimuth information, and the determination process of the target space is specifically explained in the embodiment of fig. 1, which is not described in detail in this embodiment. Since at least one object is included in the target space, further image feature recognition is required, and a captured sub-image satisfying a target feature condition that can characterize a user's facial feature, such as a change feature of the mouth, when the target audio is generated can be obtained by recognizing facial feature information of the object in the image.
For example, in a conference scene, when a person speaks and after the direction information of an object which may generate audio information is determined, a high-precision camera in the conference scene is triggered, the direction information looks for the face of the speaker, the facial expression of the speaker and the mouth shape of the speaking are judged, and the speaker is judged to be speaking, so that the speaker in the participants of the current conference can be found. When finding this speaker, the camera can carry out the accuracy with this speaker's current state and surrounding environment and shoot to show on the display screen, can make the present speaker of discernment that like this can be better, there is better immersion experience.
It should be noted that, in order to accurately identify the target object, in the embodiment of the present application, a feature identification process for the acquired image may be performed by using an Artificial Intelligence (AI) technique. For example, through a machine learning related algorithm in artificial intelligence, the facial expression in the speaking process of a speaking person is learned to generate a facial expression recognition model, so that a collected image can be output to the facial expression recognition model, a collected sub-image meeting a target characteristic condition is automatically obtained, and then character recognition is performed on the collected sub-image to obtain a target object generating audio information.
After the orientation information is obtained, a control command for image acquisition may be generated using the orientation information. The control instructions are used for controlling the image acquisition device to acquire an image of the target space matched with the orientation information, so that the image acquisition device outputs an image including the target object.
The method is suitable for the process of tracking and collecting the target object in the process of outputting the audio information, namely, the collected image of the target object included in the orientation information is output in real time by using the positioned orientation information, the method is suitable for the only object output by the current audio information, the face focusing shooting function of the image collecting device can be triggered based on the orientation information, the output collected image is the face image of the current audio information output object, and therefore other participating objects in the target scene can obtain information such as the facial expression of the target object.
On the basis of the embodiment of fig. 1, the embodiment of the present application further includes that if the identified target object is not unique, the output multimedia information can be determined by judging the duration of the generated audio when the image is output. Namely, if the audio time length generated by the target object is greater than the time length threshold value, the multimedia information matched with the target object is output.
For example, in a conference with a host for hosting, the host usually lets the speaker speak in the speaking area, and at this time, the host has a corresponding guidance language to introduce the speaker, and since the duration is short, the speaker can be directly ignored, and then the speaker is used as a target object in the finally output multimedia information, and the multimedia object of the speaker is output. The method can also avoid the interference of environmental noise, so that the target object obtained by positioning is more accurate.
Referring to fig. 6, which shows a display diagram of multimedia information provided in an embodiment of the present application, the multimedia information output in fig. 6(a) includes a face image of a target object. In order to further enable the participant in the target scene to know the detailed information of the current audio information generator, the identification information of the target object may be output while the facial image of the target object is output, and the specific process may include the following steps:
s601, acquiring identification information matched with a target object;
s602, combining information to be output corresponding to the target object with the identification information to obtain multimedia information matched with the target object;
and S603, outputting the multimedia information.
After the orientation information is obtained, identification information can be obtained according to the possible object determined by the orientation information, and the identification information can represent the unique identification of the current object and can also represent the identification matched with the output information of the current object. For example, identification information of the object, text identification information of the output audio information, and the like. Referring to fig. 6(b), it is shown that the output multimedia information includes not only the emoticon of the current audio information producer but also the name "zhang san" of the producer of the audio information. Therefore, the participant can obtain the expression information of the speaker and simultaneously realize the matching process of the graph and the person, so that the participant can better know the related information of the current speaker.
Of course, the multimedia information may also include text information of the output audio object, that is, while outputting the image information of the target object, text information matched with the speaking content of the target object may be displayed below the image, so that the participants can know the specific content of the speech in time.
In addition, the above-described embodiment is to perform the determination of the azimuth information and the identification of the target object after generating the audio information satisfying the target condition in response to the target scene. And if the target scene does not generate the audio information meeting the target condition, generating multimedia information matched with the target scene, and then outputting the multimedia information matched with the target scene. The multimedia information matched with the target scene may be scene introduction information stored in advance, such as introduction images, videos and other information of the current target scene, or may be a real-time panoramic image generated according to the current target scene. For example, if there is no audio information satisfying the target condition in the conference scene, introduction information for introducing the conference is displayed on the display device in the current conference scene, or a panoramic image of the current conference is displayed.
There is also provided in another embodiment of the present application an information processing system, see fig. 7, including:
the direction acquisition equipment 10 is used for responding to the audio information which meets the target condition and generates the audio information, and acquiring the information of the target scene by using a first identification method to obtain direction information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
a target recognition device 20 for recognizing a target space matched with the azimuth information by using a second recognition method, and determining a target object generating the audio information, wherein the target space comprises at least one object;
and the information output device 30 is used for outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
It should be noted that, in the embodiment of the information processing system of the present application, the position acquisition device may include a sound pickup device, a TOF sensor, a ranging camera, a processor, and a light source auxiliary device; similarly, the target identification device may include an image acquisition device, an image identification device, a processor and other related devices to determine the target object; the information output device may include an audio output module, a video output module, a display module, and the like. For selection of a device and a module in a specific apparatus, reference may be made to a specific description process of an embodiment of the information processing method provided in the present application, which is not described herein again.
On the basis of the above embodiment of the information processing system, the system further includes:
a scene information output unit for generating multimedia information matched with a target scene in response to the target scene not generating audio information satisfying the target condition; and outputting the multimedia information matched with the target scene.
On the basis of the above embodiment of the information processing system, the first recognition method in the orientation capturing apparatus represents a pickup system recognition method, the pickup system includes a first pickup device and a second pickup device, wherein the orientation capturing apparatus includes:
the time acquisition unit is used for acquiring audio information of the target scene by using the first sound pickup device and the second sound pickup device respectively to acquire first time information and second time information;
and the first calculation unit is used for calculating and obtaining the azimuth information according to the first time information and the second time information.
On the basis of the above-mentioned embodiment of the information processing system, the first identification method in the orientation acquisition device characterizes a position identification method, and the orientation acquisition device further includes:
the first determining unit is used for determining a first position of the target scene according to the audio information;
the first acquisition unit is used for acquiring distance information between the first position and a preset reference position;
and the second calculation unit is used for calculating and obtaining the azimuth information by utilizing the distance information.
On the basis of the above-mentioned embodiment of the information processing system, the first identification method in the orientation acquisition device characterizes a method of image identification, and the orientation acquisition device further includes:
the second acquisition unit is used for acquiring images of the object of the target scene to obtain an acquired image;
an orientation acquisition unit configured to acquire orientation information of a first object from a captured image that matches the first object in response to a change in a biometric characteristic of the first object in the captured image.
On the basis of the above-described embodiment of the information processing system, the object recognition apparatus includes:
an image acquisition unit for acquiring a collected image of at least one object of a target space matched with the orientation information;
the characteristic identification unit is used for carrying out characteristic identification on the acquired image and determining an acquired sub-image meeting a target characteristic condition;
and the second determining unit is used for determining a target object for generating the audio information according to the acquired sub-images.
On the basis of the above embodiment of the information processing system, the system further includes:
and the instruction generating unit is used for generating a control instruction of image acquisition by using the azimuth information, and the control instruction is used for controlling the image acquisition equipment to acquire an image of a target space matched with the azimuth information so that the image acquisition equipment outputs an image including a target object.
On the basis of the above embodiment of the information processing system, the information output device is specifically configured to:
and if the audio time length generated by the target object is greater than a time length threshold value, outputting multimedia information matched with the target object.
On the basis of the above-described embodiment of the information processing system, the information output apparatus includes:
the identification acquisition unit is used for acquiring identification information matched with the target object;
the information combination unit is used for combining the information to be output corresponding to the target object with the identification information to obtain multimedia information matched with the target object;
and the information output unit is used for outputting the multimedia information.
An embodiment of the present application provides a storage medium having a program stored thereon, which when executed by a processor implements the information processing method.
The embodiment of the application provides a processor, wherein the processor is used for running a program, and the information processing method is executed when the program runs.
The embodiment of the application provides an electronic device, the device comprises a processor, a memory and a program which is stored on the memory and can be run on the processor, and the processor executes the program and realizes the following steps:
generating audio information meeting target conditions in response to a target scene, and acquiring information of the target scene by using a first identification method to obtain azimuth information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
identifying a target space matched with the azimuth information by using a second identification method, and determining a target object generating the audio information, wherein the target space comprises at least one object;
and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
Further, the method further comprises:
in response to a target scene not producing audio information satisfying the target condition, generating multimedia information matching the target scene;
and outputting the multimedia information matched with the target scene.
Further, the first recognition method represents a pickup system recognition method, the pickup system includes a first pickup device and a second pickup device, wherein the acquiring information of the target scene by using the first recognition method to obtain the azimuth information includes:
acquiring audio information of the target scene by using the first sound pickup device and the second sound pickup device respectively to obtain first time information and second time information;
and calculating to obtain azimuth information according to the first time information and the second time information.
Further, the first identification method represents a position identification method, and acquiring information of the target scene by using the first identification method to obtain the azimuth information includes:
determining a first position of the target scene according to the audio information;
acquiring distance information between the first position and a preset reference position;
and calculating to obtain azimuth information by using the distance information.
Further, the first recognition method represents an image recognition method, and acquiring information of the target scene by using the first recognition method to obtain the orientation information includes:
acquiring an image of an object of the target scene to obtain an acquired image;
in response to a change in a biometric characteristic of a first object in the captured image, orientation information of the first object is obtained from the captured image that matches the first object.
Further, the identifying the target space matched with the orientation information by using the second identification method, and determining the target object generating the audio information, includes:
obtaining a captured image of at least one object of a target space that matches the orientation information;
carrying out feature recognition on the collected image, and determining a collected sub-image meeting a target feature condition;
and determining a target object for generating the audio information according to the acquisition sub-images.
Further, the method further comprises:
and generating a control instruction for image acquisition by using the orientation information, wherein the control instruction is used for controlling the image acquisition equipment to acquire an image of a target space matched with the orientation information, so that the image acquisition equipment outputs an image including a target object.
Further, the outputting the multimedia information matched with the target object includes:
and if the audio time length generated by the target object is greater than a time length threshold value, outputting multimedia information matched with the target object.
Further, the outputting the multimedia information matched with the target object includes:
acquiring identification information matched with the target object;
combining the information to be output corresponding to the target object with the identification information to obtain multimedia information matched with the target object;
and outputting the multimedia information.
The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application also provides a computer program product adapted to perform a program for initializing any of the steps of the information processing method as described above when executed on a data processing device.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The emphasis of each embodiment in the present specification is on the difference from the other embodiments, and the same and similar parts among the various embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An information processing method comprising:
generating audio information meeting target conditions in response to a target scene, and acquiring information of the target scene by using a first identification method to obtain azimuth information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
identifying a target space matched with the azimuth information by using a second identification method, and determining a target object generating the audio information, wherein the target space comprises at least one object;
and outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
2. The method of claim 1, further comprising:
in response to a target scene not producing audio information satisfying the target condition, generating multimedia information matching the target scene;
and outputting the multimedia information matched with the target scene.
3. The method of claim 1, wherein the first recognition method represents a pickup system recognition method, the pickup system comprises a first pickup device and a second pickup device, and wherein the acquiring information of the target scene by using the first recognition method to obtain the azimuth information comprises:
acquiring audio information of the target scene by using the first sound pickup device and the second sound pickup device respectively to obtain first time information and second time information;
and calculating to obtain azimuth information according to the first time information and the second time information.
4. The method of claim 1, wherein the first recognition method represents a position recognition method, and the acquiring information of the target scene by using the first recognition method to obtain the orientation information comprises:
determining a first position of the target scene according to the audio information;
acquiring distance information between the first position and a preset reference position;
and calculating to obtain azimuth information by using the distance information.
5. The method of claim 1, wherein the first recognition method represents a method of image recognition, and the acquiring information of the target scene by using the first recognition method to obtain the orientation information comprises:
acquiring an image of an object of the target scene to obtain an acquired image;
in response to a change in a biometric characteristic of a first object in the captured image, orientation information of the first object is obtained from the captured image that matches the first object.
6. The method of claim 1, wherein identifying the target space matched with the orientation information using a second identification method, determining a target object that produced the audio information, comprises:
obtaining a captured image of at least one object of a target space that matches the orientation information;
carrying out feature recognition on the collected image, and determining a collected sub-image meeting a target feature condition;
and determining a target object for generating the audio information according to the acquisition sub-images.
7. The method of claim 1, further comprising:
and generating a control instruction for image acquisition by using the orientation information, wherein the control instruction is used for controlling the image acquisition equipment to acquire an image of a target space matched with the orientation information, so that the image acquisition equipment outputs an image including a target object.
8. The method of claim 1, wherein outputting the multimedia information matching the target object comprises:
and if the audio time length generated by the target object is greater than a time length threshold value, outputting multimedia information matched with the target object.
9. The method of claim 1, wherein outputting the multimedia information matching the target object comprises:
acquiring identification information matched with the target object;
combining the information to be output corresponding to the target object with the identification information to obtain multimedia information matched with the target object;
and outputting the multimedia information.
10. An information processing system comprising:
the azimuth acquisition equipment is used for responding to the audio information which meets the target condition and is generated by the target scene, and acquiring information of the target scene by using a first identification method to obtain azimuth information; the target scene comprises at least one object, the orientation information being associated with the object from which the audio information was generated;
a target recognition device for recognizing a target space matched with the azimuth information by using a second recognition method, and determining a target object for generating the audio information, wherein the target space comprises at least one object;
and the information output equipment is used for outputting the multimedia information matched with the target object so that at least one object in the target scene can know the multimedia information of the target object.
CN201911059546.3A 2019-11-01 2019-11-01 Information processing method and system Pending CN110730378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911059546.3A CN110730378A (en) 2019-11-01 2019-11-01 Information processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911059546.3A CN110730378A (en) 2019-11-01 2019-11-01 Information processing method and system

Publications (1)

Publication Number Publication Date
CN110730378A true CN110730378A (en) 2020-01-24

Family

ID=69223600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911059546.3A Pending CN110730378A (en) 2019-11-01 2019-11-01 Information processing method and system

Country Status (1)

Country Link
CN (1) CN110730378A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112839165A (en) * 2020-11-27 2021-05-25 深圳市捷视飞通科技股份有限公司 Method and device for realizing face tracking camera shooting, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105611167A (en) * 2015-12-30 2016-05-25 联想(北京)有限公司 Focusing plane adjusting method and electronic device
CN105744208A (en) * 2014-12-11 2016-07-06 北京视联动力国际信息技术有限公司 Video conference control system and control method
CN205490942U (en) * 2016-03-16 2016-08-17 上海景瑞信息技术有限公司 Automatic positioning system of camera based on speech recognition
CN107820037A (en) * 2016-09-14 2018-03-20 南京中兴新软件有限责任公司 The methods, devices and systems of audio signal, image procossing
CN109492506A (en) * 2017-09-13 2019-03-19 华为技术有限公司 Image processing method, device and system
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105744208A (en) * 2014-12-11 2016-07-06 北京视联动力国际信息技术有限公司 Video conference control system and control method
CN105611167A (en) * 2015-12-30 2016-05-25 联想(北京)有限公司 Focusing plane adjusting method and electronic device
CN205490942U (en) * 2016-03-16 2016-08-17 上海景瑞信息技术有限公司 Automatic positioning system of camera based on speech recognition
CN107820037A (en) * 2016-09-14 2018-03-20 南京中兴新软件有限责任公司 The methods, devices and systems of audio signal, image procossing
CN109492506A (en) * 2017-09-13 2019-03-19 华为技术有限公司 Image processing method, device and system
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112839165A (en) * 2020-11-27 2021-05-25 深圳市捷视飞通科技股份有限公司 Method and device for realizing face tracking camera shooting, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112088315B (en) Multi-mode speech localization
US10074012B2 (en) Sound and video object tracking
JP6464449B2 (en) Sound source separation apparatus and sound source separation method
JP5456832B2 (en) Apparatus and method for determining relevance of an input utterance
EP3855731B1 (en) Context based target framing in a teleconferencing environment
CN102903362B (en) Integrated this locality and the speech recognition based on cloud
CN112088402A (en) Joint neural network for speaker recognition
CN111432115B (en) Face tracking method based on voice auxiliary positioning, terminal and storage device
CN107820037B (en) Audio signal, image processing method, device and system
US10582117B1 (en) Automatic camera control in a video conference system
KR102463806B1 (en) Electronic device capable of moving and method for operating thereof
CN108877787A (en) Audio recognition method, device, server and storage medium
Kapralos et al. Audiovisual localization of multiple speakers in a video teleconferencing setting
CN111551921A (en) Sound source orientation system and method based on sound image linkage
CN111251307A (en) Voice acquisition method and device applied to robot and robot
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
JP2004198656A (en) Robot audio-visual system
KR101976937B1 (en) Apparatus for automatic conference notetaking using mems microphone array
WO2021230180A1 (en) Information processing device, display device, presentation method, and program
US11460927B2 (en) Auto-framing through speech and video localizations
CN110730378A (en) Information processing method and system
CN104780341B (en) A kind of information processing method and information processing unit
JP6799510B2 (en) Scene recognition devices, methods, and programs
Zhang et al. Boosting-based multimodal speaker detection for distributed meetings
CN114040107A (en) Intelligent automobile image shooting system, method, vehicle and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200124

RJ01 Rejection of invention patent application after publication