CN117676072B - AR-based multi-person complex interactive conference method and device - Google Patents

AR-based multi-person complex interactive conference method and device Download PDF

Info

Publication number
CN117676072B
CN117676072B CN202410133163.0A CN202410133163A CN117676072B CN 117676072 B CN117676072 B CN 117676072B CN 202410133163 A CN202410133163 A CN 202410133163A CN 117676072 B CN117676072 B CN 117676072B
Authority
CN
China
Prior art keywords
real
participant
time
information
participants
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410133163.0A
Other languages
Chinese (zh)
Other versions
CN117676072A (en
Inventor
曾铮
隋璐捷
赵婷
陈家璘
余铮
叶俊儒
胡晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Telecommunication Branch of State Grid Hubei Electric Power Co Ltd
Original Assignee
Information and Telecommunication Branch of State Grid Hubei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Telecommunication Branch of State Grid Hubei Electric Power Co Ltd filed Critical Information and Telecommunication Branch of State Grid Hubei Electric Power Co Ltd
Priority to CN202410133163.0A priority Critical patent/CN117676072B/en
Publication of CN117676072A publication Critical patent/CN117676072A/en
Application granted granted Critical
Publication of CN117676072B publication Critical patent/CN117676072B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/157Conference systems defining a virtual conference space and using avatars or agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/1066Session management
    • H04L65/1083In-session procedures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to the technical field of digitalization, and provides an AR-based multi-user complex interactive conference method and device. The method comprises the following steps: when each participant singly speaks, establishing an association relationship between the speaking participant and the voiceprint feature; when multiple participants speak at the same time, performing feature matching on the real-time voice data by using first voiceprint features of each first participant who is speaking, and separating and extracting semantic information of each first participant; and (3) identifying the real-time images to obtain each operation object, matching the semantic information of each first participant with each operation object, and marking the matched operation objects by using the same color as that of the participants. According to the invention, the voice of multiple persons is separated, the personal audio obtained by separation is identified, and the operation object is marked according to the identification result, so that the remote operator can quickly identify the expression information of each person when the multiple persons speak at the same time, the efficiency of the multi-person conference is improved, and the normal and orderly operation of the multi-person conference is ensured.

Description

AR-based multi-person complex interactive conference method and device
Technical Field
The invention relates to the technical field of digitalization, in particular to an AR-based multi-person complex interactive conference method and device.
Background
Internet-based video communication technology is widely used in video conference scenes for work and life. With the development of augmented reality hardware and software technology, video conferencing applications based on augmented reality technology have matured and are used by more people.
The augmented reality (Augmented Reality, abbreviated as AR) is a digital technology for calculating the position and angle of a camera image in real time and adding a corresponding image, and is a new technology for integrating real world information and virtual world information in a seamless manner. Real environment and virtual object are superimposed on the same picture or space in real time and exist at the same time. The augmented reality technology not only displays real world information, but also simultaneously displays virtual information, and the two information are mutually supplemented and overlapped. In visual augmented reality, a user can see the real world around it by multiplexing the real world with computer graphics using augmented reality hardware, such as a head mounted display or augmented reality glasses. The augmented reality technology comprises new technologies and new means of multimedia, three-dimensional modeling, real-time video display and control, multi-sensor fusion, real-time tracking and registration, scene fusion and the like. Augmented reality provides information that is generally different from what a human can perceive.
In the prior art, the main means based on communication in the AR conference is still audio, but when a plurality of persons participate in the conference, the situation that the persons speak simultaneously is caused because the speaking time is not effectively mastered often exists among the plurality of persons, and at the moment, an operator at a far end cannot accurately distinguish what is specifically said by which participating person, so that the conference efficiency is affected.
In view of this, overcoming the drawbacks of the prior art is a problem to be solved in the art.
Disclosure of Invention
The invention aims to provide an AR-based multi-user complex interactive conference method.
The invention adopts the following technical scheme:
in a first aspect, the present invention provides an AR-based multi-person complex interactive conference method, in which a first camera, a microphone and a display screen are disposed in a teleconference room, the first camera is used for collecting real-time conference room images in the teleconference room, the microphone is used for collecting real-time voice data of each participant, a remote operator wears an AR device to operate, a second camera is disposed on the AR device, and the second camera is used for collecting real-time images in front of the remote operator, the method includes:
Receiving real-time conference room images and real-time voice data from a remote conference room side, receiving real-time images from an AR device side, transmitting the real-time images to the remote conference room side for display on the display screen, and transmitting the real-time voice data to the AR device side for playing the real-time voice data on the AR device side; matching corresponding virtual head portraits for each participating person, and transmitting each participating person and the corresponding virtual head portraits to a remote conference room side and an AR equipment side so as to synchronously display a participating person list on the display screen and the AR equipment;
performing face recognition on the real-time conference room image to obtain face information of each participant, and recognizing whether each participant is speaking or not according to the face information;
when each participant is identified to speak independently, extracting features of the real-time voice data acquired currently to obtain voiceprint features of the participant speaking, and establishing an association relationship between the participant speaking and the voiceprint features;
when a plurality of participants are identified to speak at the same time, identifying a plurality of first participants who speak according to the face information, finding first voiceprint features corresponding to the first participants from the association relationship, performing feature matching on the currently acquired real-time voice data by using the first voiceprint features so as to separate personal audio of the first participants from the real-time voice data, and performing semantic identification on the personal audio of the first participants to obtain semantic information of the first participants;
Identifying the real-time image to obtain each operation object in the real-time image, matching semantic information of each first participant with each operation object in the real-time image, generating guide information according to a matching result, and transmitting the guide information to an AR device side; the instruction information comprises information of each participant who is speaking and operation related information matched with each participant;
the AR equipment marks the speaking participants in the participant list by using different colors according to the instruction information, and marks the operators matched with the participants by using the same colors as the corresponding participants, so that the remote operators can identify the speaking participants and the operators mentioned by the participants according to the different marks.
Preferably, the feature matching is performed on the currently acquired real-time voice data by using each first voiceprint feature so as to separate personal audio of each first participant from the real-time voice data, which specifically includes:
performing short-time Fourier transform on the real-time voice data acquired currently to obtain a mixed spectrum characteristic;
Splicing the mixed spectrum features with the corresponding first voiceprint features to obtain reference spectrum features, and inputting the reference spectrum features into an expansion convolution layer to obtain basic features;
inputting the basic features into a voice separation model, outputting to obtain a spectrum mask, multiplying the spectrum mask by the mixed spectrum features to obtain personal spectrums of corresponding first participants, and recovering the personal spectrums by using the phase spectrums of the real-time voice data acquired currently to obtain personal audios of the corresponding first participants.
Preferably, the identifying, according to the face information, a plurality of first participants who are speaking specifically includes:
after face information is identified, selecting each face region, detecting a plurality of key feature points of the face region by using a feature point detection model, calculating to obtain a feature value of the face region by using the position relation among the plurality of key feature points, and identifying whether the face region is in a mouth opening state according to the feature value;
if the frame number ratio of the corresponding face area in the mouth opening state is higher than a preset ratio in the continuous multi-frame real-time conference room images, and the variance of the characteristic value of the face area in the multi-frame real-time conference room images in the mouth opening state is larger than a first preset value, identifying that the participant corresponding to the face area is speaking.
Preferably, the plurality of key feature points include a lip left edge feature point P1, a lip right edge feature point P2, an upper lip left lip peak highest feature point P3, an upper lip right lip peak highest feature point P4, a lower lip left side feature point P5 opposite to the upper lip left lip peak highest feature point, and a lower lip right side feature point P6 opposite to the upper lip right lip peak highest feature point; the calculating to obtain the feature value of the face region by using the position relation among the plurality of key feature points specifically comprises:
using the difference between P2 and P1 as a first difference, the difference between P3 and P5 as a second difference, and the difference between P4 and P6 as a third difference;
dividing the result obtained by adding the second difference value and the third difference value by the first difference value, and multiplying the result by a preset coefficient to obtain a characteristic value;
and identifying whether the face area is in a mouth opening state according to the characteristic value, wherein the method specifically comprises the following steps of: and when the characteristic value is larger than a first preset value, recognizing that the face area is in a mouth opening state.
Preferably, the matching the semantic information of each first participant with each operator in the real-time image specifically includes:
After identifying and obtaining an operation object, selecting an operation object vector corresponding to the operation object from a semantic network;
calculating the distance between each semantic vector and each operator vector in the semantic information in the semantic network;
selecting one operator vector with the smallest distance from operator vectors with the distance smaller than the preset distance from the corresponding semantic vectors, wherein the operator corresponding to the operator vector is the operator matched with the semantic information of the first participant.
Preferably, the method further comprises:
when semantic information of one consultant matches with a plurality of operators, sequencing the operators according to the sequence of semantic vectors corresponding to the operators in the semantic information, and transmitting sequence information obtained by sequencing to AR equipment;
the AR equipment marks a plurality of operators mentioned by the same participant by using the same color as the participant, and marks serial numbers around the operators according to the sequence information; wherein the sequence numbers are used to represent the order in which the respective operators are mentioned.
Preferably, the identifying the real-time image to obtain each operator in the real-time image is based on performing contour analysis on the implemented image, and matching each contour obtained by analysis in a preset object library; when the corresponding first contour is not matched with the corresponding object in the preset object library, marking a dead zone mark at the position of the first contour in the real-time image so as to remind a remote operator to manually identify the first contour.
Preferably, the remote operator performs manual identification on the first profile, and specifically includes:
when recognizing that the gesture of the remote operator points to the position of the first contour and collecting voice data of the remote operator, carrying out semantic recognition on the voice data to obtain an operator corresponding to the first contour.
Preferably, the method further comprises:
and displaying semantic information of the participants beside the virtual head portraits corresponding to the participants, and playing personal audio of the participants when recognizing that the gestures of the remote operators point to the semantic information of the corresponding participants.
In a second aspect, the present invention further provides an AR-based multi-person complex interactive conference device, configured to implement the AR-based multi-person complex interactive conference method of the first aspect, where the device includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the AR-based multi-person complex interactive conferencing method of the first aspect.
In a third aspect, the present invention also provides a non-volatile computer storage medium storing computer-executable instructions for execution by one or more processors to perform the AR-based multi-person complex interactive conferencing method of the first aspect.
According to the invention, the voice of multiple persons is separated, the personal audio obtained by separation is identified, and the operation object is marked according to the identification result, so that the remote operation personnel can quickly identify the expression information of each person when the multiple persons speak at the same time, the efficiency of the multi-person conference is improved, and the normal and orderly operation of the multi-person conference is ensured.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a schematic diagram of an architecture among a teleconference room, a server and an AR device in an AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
Fig. 2 is a schematic flow chart of a first AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 3 is an interaction schematic diagram among a remote conference room, a server and an AR device in an AR-based multi-person complex interaction conference method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of guiding information in an AR-based multi-user complex interactive conference method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an AR device side in a first AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an AR device side in a second AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of guiding information in another AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 8 is a flow chart of a second AR-based multi-person complex interactive conferencing method according to an embodiment of the present invention;
fig. 9 is a schematic flow chart of a third AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of yet another AR-based multi-person complex interactive conferencing method provided by an embodiment of the present invention;
FIG. 11 is a schematic diagram of yet another AR-based multi-person complex interactive conferencing method provided by an embodiment of the present invention;
fig. 12 is a flow chart of a fourth AR-based multi-person complex interactive conferencing method according to an embodiment of the present invention;
fig. 13 is a flow chart of a fifth AR-based multi-person complex interactive conferencing method according to an embodiment of the present invention;
fig. 14 is a flowchart of a sixth AR-based multi-person complex interactive conferencing method according to an embodiment of the present invention;
fig. 15 is a schematic diagram of an AR device side in a third AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 16 is a schematic diagram of a cement pump slurry spraying machine in an initial factory building environment model in an AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 17 is a schematic diagram of a cement pump slurry spraying machine in a front actual factory building image in an AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
FIG. 18 is a schematic diagram of another AR-based multi-person complex interactive conferencing method provided by an embodiment of the present invention;
fig. 19 is a schematic diagram of an AR device side in a fourth AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
Fig. 20 is a schematic diagram of an AR device side in a fifth AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 21 is a schematic diagram of an AR device side in a sixth AR-based multi-person complex interactive conference method according to an embodiment of the present invention;
fig. 22 is a schematic architecture diagram of an AR-based multi-person complex interactive conference device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The terms "first," "second," and the like herein are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Example 1:
the embodiment 1 of the invention provides an AR-based multi-person complex interactive conference method, in which a first camera, a microphone and a display screen are disposed in a remote conference room, the first camera is used for collecting real-time conference room images in the remote conference room, the microphone is used for collecting real-time voice data of each participant, the remote operator wears an AR device to operate, the AR device is disposed with a second camera, the second camera is used for collecting real-time images in front of the remote operator, in practical use, the AR device may be an AR glasses, a host is disposed in the remote conference room, the host and the AR device are connected to a server, so as to realize communication, wherein the AR device may be directly connected to the server, as shown in fig. 1, or connected to a terminal device through bluetooth, and then connected to the server through the terminal device. As shown in fig. 2, the method in this embodiment includes:
In step 201, receiving real-time conference room image and real-time voice data from a remote conference room side, receiving real-time image from an AR device side, transmitting the real-time image to the remote conference room side for display on the display screen, and transmitting the real-time voice data to the AR device side for playing the real-time voice data at the AR device side; and matching corresponding virtual head portraits for each participating person, and transmitting each participating person and the corresponding virtual head portraits to a remote conference room side and an AR equipment side so as to synchronously display the participating person list on the display screen and the AR equipment.
In step 202, the real-time conference room image is subjected to face recognition to obtain face information of each participant, and whether each participant is speaking or not is recognized according to the face information.
In step 203, when identifying that each participant individually speaks, extracting features of the real-time voice data acquired currently to obtain voiceprint features of the speaking participant, and establishing an association relationship between the speaking participant and the voiceprint features; in this embodiment, taking into account that in actual use, participants often introduce themselves at the beginning of a conference, at this time, each participant usually speaks alone, and in this embodiment, voiceprint features of each participant are collected in this period of time, and an association relationship (actually expressed as a mapping relationship between id of the participant and the voiceprint features) between each participant and the voiceprint features is established, so as to prepare for normal proceeding of a subsequent conference. The voiceprint features may be one or more of features such as magnitude spectrum, mel cepstrum, etc., and the corresponding features are extracted as the prior art, which is not described herein.
In step 204, when it is identified that there are multiple participants speaking at the same time, multiple first participants speaking are identified according to the face information, first voiceprint features corresponding to the first participants are found from the association relationship, and feature matching is performed on the currently acquired real-time voice data by using the first voiceprint features, so as to separate personal audio of the first participants from the real-time voice data, and semantic recognition is performed on the personal audio of the first participants to obtain semantic information of the first participants.
In step 205, the real-time image is identified to obtain each operation object in the real-time image, semantic information of each first participant is matched with each operation object in the real-time image, guidance information is generated according to the matching result, and the guidance information is transmitted to an AR device side; the instruction information comprises information of each participant who is speaking and operation related information matched with each participant; in practical use, the guiding information is further transmitted to the remote conference room side, so that the guiding information is conveniently displayed on a display screen of the remote conference room side, and image synchronization of the AR equipment side and the display screen of the remote conference room side is achieved, as shown in fig. 3. In actual use, the guiding information is shown in fig. 4, and includes the number of participants and the related information of each participant (i.e. the related information of participant 1, the related information of participant 2 and the related information of participant n in fig. 4), each participant related information includes at least a participant id, the number of operators and a terminator, when the corresponding participant is not speaking, the number of operators is 0, and when the corresponding participant is speaking, the related information of each operator (i.e. the related information of operator 1, the related information of operator 2 and the related information of operator n in fig. 4), and each operator information includes an operator number, a contour information length and a contour information, wherein, the contour information length is the number of bits occupied by the contour information, so that when the guiding information is received at the receiving side (i.e. the AR equipment side or the remote conference room side), the corresponding contour information is obtained by parsing according to the contour information length, and the contour information is the contour corresponding to the operator in the latest real-time image, so as to mark the contour 206 for the subsequent operator in the following step. In the instruction information, the number of the participants, the id of the participants, the number of the operators, the serial numbers of the operators and the number of bits occupied by the operators are all obtained and preset by the skilled in the art according to experience analysis, the length corresponding to the profile information is determined by the number of bits occupied by the profile information, and the lengths corresponding to different profile information can be different.
In step 206, the AR device marks the speaking participants in the participant list with different colors according to the instruction information, and marks the operators matched with the participants with the same color as the corresponding participants, so that the remote operators can identify the speaking participants and the operators mentioned by the participants according to the different marks.
For example, as shown in fig. 5, if a remote operator is performing a test in a laboratory, a laboratory environment image acquired by a corresponding AR device is transmitted to a server in a form of a real-time image, and the server identifies that a plurality of operators exist in the real-time image: the method comprises the steps that an iron stand, a reagent bottle, an alcohol lamp, a beaker and a glass test tube are arranged, a server analyzes images of a real-time conference room to obtain that two participants currently exist and speak at the same time, the two participants are respectively person 1 and person 3, personal audio of the person 1 and personal audio of the person 3 are separated, semantic identification is carried out to obtain semantic information of the person 1, namely, the person 1 takes out a reagent from the reagent bottle and pours the reagent into the beaker, the semantic information of the person 3 is the "light alcohol lamp", the server sends guiding information to an AR equipment side as shown in fig. 7, wherein related information of the participant 1 comprises two pieces of related information of operators, namely, related information of the reagent bottle and related information of the beaker, and related information of the participant 3 comprises one piece of related information of the operator, namely, related information of the alcohol lamp.
After receiving the instruction information, the AR device side parses the instruction information, marks the participants with different colors corresponding to the number of operators not being zero, for example, in fig. 5, marks the participant 1 (i.e., participant 1) and the participant 3 (participant 3), marks the operators (i.e., reagent bottles and beakers) corresponding to the participant 1 in the same mark form as the participant 1, marks the operators (i.e., alcohol lamps) corresponding to the participant 3 in the same mark form as the participant 3, and highlights the word matched with the operators in the semantic information of each participant.
It should be noted that fig. 5 is only a schematic representation of the technical solution of the present invention, and is not a representation of an actual image seen by a remote operator in actual use, in which the remote operator sees a three-dimensional image of an actual environment through an AR device, rather than a two-dimensional plane representation as shown in fig. 5, and in actual use, each participant uses different color marks, and in fig. 5, the marks are labeled with a color one and a color two in order to represent the different colors of each mark.
In practical use, the operation objects not only refer to independent objects, but also can be assembly parts on corresponding equipment, at this time, the identifying the real-time image, and obtaining each operation object in the real-time image specifically includes: the method comprises the steps of firstly identifying relatively independent equipment, and then identifying each part in the equipment according to a preset part profile library corresponding to the equipment. The individual items or devices are identified by matching in a predetermined library of items. For example, in fig. 6, when the equipment of the cement pump grouting machine exists in the image of the real-time conference room, part matching is performed in the part profile library corresponding to the cement pump grouting machine, so that each part is identified. When the semantic information of the person 1 is "checking whether sediment exists in the hopper and checking whether the vibrator operates normally", and at the same time, the semantic information of the person 3 is "checking whether the guniting pipeline is normal", the outlines of the operators "hopper", "vibrator" and "guniting pipeline" are correspondingly identified.
According to the embodiment, the multi-person voice is separated, personal audio obtained through separation is identified, and the operation object is marked according to the identification result, so that a remote operator can quickly identify the expression message of each person when the multi-person simultaneously speaks, the efficiency of the multi-person conference is improved, and the normal and orderly operation of the multi-person conference is ensured.
In an optional embodiment, the feature matching is performed on the currently collected real-time voice data by using each first voiceprint feature to separate personal audio of each first participant from the real-time voice data, as shown in fig. 8, which specifically includes:
in step 301, short-time fourier transform is performed on the real-time speech data acquired currently, to obtain a hybrid spectrum feature.
In step 302, the mixed spectrum feature is spliced with the corresponding first voiceprint feature to obtain a reference spectrum feature, and the reference spectrum feature is input to an expansion convolution layer to obtain a basic feature; the dilated convolution layer is to capture low-level audio features. The dilated convolutional layer includes one or more convolutional neural networks. Such as an 8-layer convolutional neural network.
In step 303, the basic feature is input to a voice separation model, a spectrum mask is obtained by output, the spectrum mask is multiplied by the mixed spectrum feature to obtain a personal spectrum of the corresponding first participant, and the phase spectrum of the real-time voice data obtained by current collection is used for recovering the personal spectrum to obtain the personal audio of the corresponding first participant. Wherein, the voice separation model is trained by a person skilled in the art in advance, and the voice separation model can be a deep learning model, such as a PIDNet model and the like.
In an actual application scenario, the identifying, according to the face information, a plurality of first participants who are speaking, as shown in fig. 9, specifically includes:
in step 401, after face information is identified, each face region is selected, a feature point detection model is used to detect a plurality of key feature points of the face region, a position relationship among the plurality of key feature points is used to calculate a feature value of the face region, and whether the face region is in a mouth opening state is identified according to the feature value.
In step 402, if the frame number ratio of the corresponding face region in the mouth-open state is higher than the preset ratio in the continuous multi-frame real-time conference room image, and the variance of the feature value of the face region in the multi-frame real-time conference room image in the mouth-open state is greater than the first preset value, it is identified that the participant corresponding to the face region is speaking. The preset ratio and the first preset value are obtained by a person skilled in the art from empirical analysis. The continuous multi-frame real-time conference room image is determined by the collected real-time voice data, and specifically: determining speaking start time t1 and speaking end time t2 according to the amplitude of real-time voice data, executing the steps 401-402 on real-time conference room images between t 1-t 2, selecting a plurality of sliding windows by using the sliding window with preset size w and preset step length s when executing the step 402, and executing the step 402 on continuous multi-frame real-time conference room images in each sliding window. As shown in fig. 10, a sliding window is selected every preset step s, the size of the sliding window is w, two sliding windows formed recently at a distance t1 in fig. 10 are win1 and win2, when the variance of the characteristic value corresponding to the corresponding face area is detected to be larger than a first preset value in the sliding window, and the frame ratio of the face area in the mouth opening state is higher than a preset proportion, the meeting person corresponding to the face area is considered to speak in the sliding window time period, and when the condition that one meeting person in a plurality of continuous sliding windows always speak is obtained, the whole time period corresponding to the sliding windows is taken as an analysis main body to extract personal audio and perform semantic recognition. Wherein the preset size w and the preset step s are obtained by a person skilled in the art according to empirical analysis.
The determining the speaking start time t1 and the speaking end time t2 according to the amplitude of the real-time voice data specifically includes:
when the amplitude of the corresponding position is detected to be higher than the first preset amplitude, a sliding window with the size w is selected by taking the position as a starting point, the average amplitude in the sliding window is calculated, if the average amplitude is higher than the second preset amplitude, the position is the speaking starting time t1, similarly, after the speaking time is detected, if the amplitude of the corresponding position is detected to be lower than the third preset amplitude, the sliding window with the size w is selected by taking the position as the starting point, the average amplitude in the sliding window is calculated, and if the average amplitude is lower than the fourth preset amplitude, the position is the speaking ending time t2. The first preset amplitude, the second preset amplitude, the third preset amplitude and the fourth preset amplitude are all obtained by the skilled person according to the empirical analysis.
In an alternative embodiment, the feature point detection model may be a shape_predictor_68_face_landmarks model of a dlib library, and as shown in fig. 11, the plurality of key feature points includes a lip left edge feature point P1, a lip right edge feature point P2, an upper lip left lip peak highest feature point P3, an upper lip right lip peak highest feature point P4, a lower lip left feature point P5 opposite to the upper lip left lip peak highest feature point, and a lower lip right feature point P6 opposite to the upper lip right lip peak highest feature point; the calculating, using the positional relationships among the plurality of key feature points, obtains feature values of the face region, as shown in fig. 12, specifically includes:
In step 501, the difference between P2 and P1 is used as a first difference, the difference between P3 and P5 is used as a second difference, and the difference between P4 and P6 is used as a third difference.
In step 502, dividing the result obtained by adding the second difference value and the third difference value by the first difference value, and multiplying the result by a preset coefficient to obtain a characteristic value; the preset coefficient is obtained by a person skilled in the art through empirical analysis, and in an alternative embodiment, the value range of the preset coefficient is 1.5-2.5. Expressed in the form of a formula:
wherein k is a preset coefficient.
And identifying whether the face area is in a mouth opening state according to the characteristic value, wherein the method specifically comprises the following steps of: and when the characteristic value is larger than a first preset value, recognizing that the face area is in a mouth opening state.
In an actual application scenario, the matching the semantic information of each first participant with each operator in the real-time image, as shown in fig. 13, specifically includes:
in step 601, after recognizing and obtaining an operator, selecting an operator vector corresponding to the operator from a semantic network; the operator vector is a vector corresponding to the name of the operator in the semantic network.
In step 602, calculating distances between each semantic vector and each operator vector in the semantic information in the semantic network; wherein the semantic information comprises a plurality of semantic vectors, each semantic vector representing a word.
In step 603, one operator vector with the smallest distance is selected from the operator vectors with the distance smaller than the preset distance, and the operator corresponding to the operator vector is the operator matched with the semantic information of the first participant. The preset distance is obtained by an empirical analysis by a person skilled in the art.
In a preferred embodiment, as shown in fig. 14, the method further comprises:
in step 701, when the semantic information of one participant matches a plurality of operators, sorting the operators according to the arrangement sequence of the semantic vectors corresponding to the operators in the semantic information, and transmitting the order information obtained by sorting to the AR device.
In step 702, the AR device marks the sequence number around the operator according to the sequence information while marking the plurality of operators mentioned by the same participant with the same color as the participant; wherein the sequence numbers are used to represent the order in which the respective operators are mentioned, as shown in fig. 5.
The real-time image is identified, and each operator in the real-time image is obtained based on contour analysis of the implemented image, and each contour obtained by analysis is matched in a preset object library; in some ways, there is also a situation that no corresponding operator exists in the object library, or the operator is not correctly identified due to different observation angles, and this embodiment also provides a preferred implementation manner, that is, when the corresponding first contour is not matched with the corresponding object in the preset object library, a blind area identifier is marked at the position where the first contour is located in the real-time image, so as to remind a remote operator to manually identify the first contour. As shown in fig. 15, when the second operator from left to right fails to match, a blind area identifier (i.e., question mark) is identified at the location where the second operator is located. The preset object library is set by analyzing the application scene requirements of the conference in advance by a person skilled in the art. Wherein, remote operation personnel carry out manual identification to first profile, specifically include: when recognizing that the gesture of the remote operator points to the position of the first contour and collecting voice data of the remote operator, carrying out semantic recognition on the voice data to obtain an operator corresponding to the first contour. The remote operator designates what the corresponding operator is, and after the remote operator designates, the semantic information of the remote operator is converted into a semantic vector, and the semantic vector is the operator vector corresponding to the designated operator.
In some embodiments, the method further comprises: and displaying semantic information of the meeting personnel beside the virtual head portrait corresponding to the meeting personnel, and playing personal audio of the meeting personnel when recognizing that the gesture of the remote operating personnel points to the semantic information of the corresponding meeting personnel so as to guide the operation for the remote operating personnel.
Example 2:
on the basis of embodiment 1, this embodiment also provides a preferred embodiment, an initial factory building environment model is preset in a factory building server (i.e., a server in embodiment 1), a remote operator wears an AR device to operate in a factory building, the AR device is connected to a terminal of the remote operator through bluetooth, and the terminal is connected to the factory building server after security verification, where the AR device may be AR glasses, the initial factory building environment model is created when a new factory building is built or a new device is assembled in the factory building, and the initial factory building environment model includes an initial shape of each device (including a model of the device in a three-dimensional environment and size information of a corresponding critical position in the model), a name of each device, a position of each device in the factory building, a traffic path in the factory building, and the like. The security verification method is provided by a manufacturer, for example, a certificate is used for issuing, message encryption and decryption, user login and the like are realized, and the method in the embodiment comprises the following steps:
The manufacturer server receives a front actual plant image (i.e. a real-time image in embodiment 1) from the terminal, performs primary matching on the front actual plant image and each initial device in the initial plant environment model, and determines an actual device operated by a remote operator according to a primary matching result.
Performing secondary matching on the actual equipment and corresponding initial equipment in the initial factory building environment model, judging whether sediment exists on an operation surface of the actual equipment according to a secondary matching result, and calculating the thickness of the sediment when the sediment exists; the method comprises the steps that primary matching is used for matching whether the contour similarity between actual equipment and corresponding initial equipment is larger than preset similarity, after the contour similarity between the actual equipment and first initial equipment is obtained through matching and is larger than the preset similarity, secondary matching is carried out on the actual equipment and the first initial equipment, the secondary matching is used for calculating the distance from a deposition surface of a deposit to the corresponding edge after each edge of the actual equipment is matched with the first initial equipment, and the thickness of the deposit is obtained through calculation according to the distance; the profile similarity may be obtained using a matchShapes function.
The preset similarity is obtained by a person skilled in the art based on empirical analysis. When the actual equipment is matched with the first initial equipment in one-time matching, the actual equipment is considered to be the used state of the first initial equipment. The second matching may be considered to be a matching between an initial state of the device (i.e. a state when not in use, represented as a first initial device in the initial plant environment model) and a current state of the device (i.e. a state after use, represented as an actual device in the current actual plant image).
When the thickness of the sediment is larger than the preset thickness, the sediment removal processing is judged to be needed, and prompt information is displayed on the AR equipment and/or prompt voice information is sent out so as to remind a remote operator to remove the sediment before operation. The preset thickness is obtained by a person skilled in the art through empirical analysis, and different preset thicknesses can be set for different types of actual devices.
For example, in an initial factory environment model, a three-dimensional model of a cement pump guniting machine when not in use is built on a cement pump guniting machine on a certain construction site, as shown in fig. 16, the working principle is as follows: the mixed slurry is added into a hopper, the slurry is pushed to an outlet by a vibrator arranged on the side wall of the hopper and blades at the bottom of the hopper, the slurry at the outlet is sprayed to a position to be covered by using a high-pressure pump or an air compression mode, the slurry is usually a mixture of cement, sand, water and other additives, and when the slurry in the hopper is not cleaned and dredged in time after the last construction is finished, partial slurry is likely to be deposited in the hopper in the next use, the available volume of the hopper is reduced, and even the blades are blocked, so that the blades cannot rotate normally.
After the cement pump slurry spraying machine exists in the front actual factory building image, performing edge matching on the cement pump slurry spraying machine (namely actual equipment) in the front factory building image and the unused cement pump slurry spraying machine (namely first initial equipment) in the initial factory building environment model, namely performing corresponding enlarging or shrinking operation on the cement pump slurry spraying machine in the front factory building image, so that the cement pump slurry spraying machine is aligned with the edge of the cement pump slurry spraying machine in the initial factory building environment model, and obtaining a protruding part of the cement pump slurry spraying machine in the front factory building image relative to the cement pump slurry spraying machine in the initial factory building environment model, wherein the protruding part is sediment, and the size information of the cement pump slurry spraying machine in the initial factory building environment model is obtained according to the scaling of the front factory building image.
When the cement pump guniting machine in the front factory building image is shown in fig. 17, after the cement pump guniting machine in the front factory building image is aligned with the edge of the cement pump guniting machine in the initial factory building environment model, as shown in fig. 18, sediment is measured to be still present on the inner wall of the hopper of the cement pump guniting machine in the front factory building image, so that the maximum diameter of the inner wall of the hopper is measured to be L2, and the maximum diameter of the inner wall of the hopper of the cement pump guniting machine in the initial factory building environment model is L1, the thickness T= (L1-L2)/2 of the sediment is displayed on the AR equipment when the calculated thickness is larger than the preset thickness of a person skilled in the art, as shown in fig. 19.
According to the method and the device, the image of the AR equipment side is obtained through the manufacturer server side, so that the image is matched with the initial manufacturer environment model stored by the manufacturer server side, sediment is detected on equipment obtained through matching, effective technical reference can be provided for remote operators, so that subsequent experts can conduct technical guidance on the remote operators normally, and the initial manufacturer environment model and the detection and identification of the sediment are processed through the manufacturer server side, so that information safety of manufacturers can be ensured, and information leakage of the models in transmission is avoided.
In a preferred embodiment, after determining that the deposit removal process is required, the method further comprises: finding out corresponding deposition removing equipment and/or deposition removing reagent from the initial factory building environment model; identifying the type of the deposit, generating one or more deposition removal schemes for the remote operator according to the type of the deposit, the thickness of the deposit, the deposition removal device or the deposition removal reagent, and displaying the one or more deposition removal schemes on the AR device.
When a remote operator selects one of the deposition removal schemes, navigation information to the corresponding deposition removal equipment and/or deposition removal reagent is displayed in the AR equipment; the method comprises the steps of firstly estimating the types of the sediments according to the types of actual equipment in sequence, and estimating according to operation records of operators acquired through historical collection if the types of the sediments cannot be limited to a single type.
Since in actual use, one device may be capable of processing multiple materials, the adhesion firmness of different materials is different, and the corresponding deposition removal scheme may also be different, so that definition of the types of deposits is performed through the types of the actual devices, for example, cement is used as the processed material for a cement pump grouting machine, when the types of deposits cannot be limited to a single type through the types of the actual devices, judgment is performed according to the operation records of the historic collected operators, in this embodiment, all historic factory building images of the historic operators wearing AR devices to operate in a factory building are transmitted to a factory server, the factory server analyzes the historic factory building images to obtain which materials are obtained by the historic operators, and which materials are added to the actual devices, so that the materials added to the actual devices last time in the history are used as the deposits of the actual devices. The types of deposits that can be treated and the corresponding deposition removal schemes are pre-set for storage by a person skilled in the art, said limiting the types of deposits to a single type not referring to limiting the types of deposits to a single compound, but to one of the types of deposits that are pre-set for storage by a person skilled in the art. If the actual equipment is a cement pump guniting machine, the sediment is determined to be cement, so that a deposition removing scheme corresponding to the cement with different thicknesses preset by a person skilled in the art is obtained, for example, when the deposition thickness of the cement is less than or equal to 0.5cm (namely, the preset thickness), the sediment is considered to have smaller influence on the normal operation of the actual equipment, no treatment is needed, when the deposition thickness of the cement is more than 0.5cm and less than 1.5cm, the sediment is considered to be thinner, at the moment, the hopper can be knocked manually by using a tool (such as a hammer), or cement slurry is introduced, and the attached cement is detached from the inner wall of the hopper by starting a vibrator and repeatedly and positively starting blades; when the deposition thickness of cement is 1.5cm or more and 5cm or less, it is considered that the manner of manually drilling or introducing cement slurry by using a hammer, starting a vibrator, and repeatedly starting blades in the forward and reverse directions, and possibly cutting the cement attached to the inner wall of the hopper by using a pneumatic pick, when the deposition thickness of cement is greater than 5cm, it is considered that the cement attached to the inner wall of the hopper cannot be processed by using a simple physical tool or a cement pump sprayer, cutting the cement attached to the inner wall of the hopper by using a pneumatic pick, or dissolving the cement attached by using a cement dissolving agent, and after the thickness of the deposit is calculated, the server selects a corresponding deposition removing scheme to be transmitted to an AR device, so that the deposition removing scheme is displayed in the AR device, and when the thickness T of the cement deposited in the hopper of the cement pump sprayer is 3cm, for example, three alternative deposition removing schemes are displayed as shown in fig. 19.
The remote operator selects one of the deposition schemes by gesture recognition or by a hardware button provided on the AR device. The displaying, in the AR device, navigation information to the corresponding deposition removal device or deposition removal reagent, specifically includes:
the manufacturer server determines the position of a remote operator according to the primary matching result, and finds the position of deposition removing equipment and/or deposition removing reagent corresponding to the corresponding deposition scheme from the initial factory building environment model; and determining the position of the remote operator according to the primary matching result, namely taking the position of the first initial equipment in the manufacturer environment model as the position of the remote operator.
Generating a shortest path to the deposition apparatus and/or the deposition reagent according to the position of the deposition apparatus and/or the deposition reagent and the position of a remote operator; and sending the shortest path to the terminal so that the terminal can display the corresponding travelling direction on the AR equipment according to the shortest path and the real-time position of a remote operator.
If the air pick is placed at the corresponding position of the factory building when the factory building is initially built, the position of the air pick is also built into the initial factory building environment model. The deposition removing equipment and/or the deposition removing reagent required by each deposition removing scheme can be preset and stored by a person skilled in the art, semantic information can be extracted from the deposition removing scheme after the corresponding deposition removing scheme is acquired, the required deposition removing equipment and/or the deposition removing reagent is obtained by analysis from the semantic information, whether the deposition removing equipment and/or the deposition removing reagent exists or not is searched in an initial factory environment model, and if the deposition removing equipment and/or the deposition removing reagent exists, the shortest path is generated according to the position of the deposition removing equipment and/or the deposition removing reagent, the real-time position of a remote operator and the environment and equipment layout in the initial factory environment model. For example, when the remote operator selects the scheme 3 as in fig. 19, as shown in fig. 20, the complete navigation route is displayed in the upper left corner of the AR device, and the start position, the end position, and the current position of the remote operator are displayed, while an arrow indicating the traveling direction for the remote operator is displayed above the AR device. In an alternative embodiment, when the manufacturer server obtains that the remote operator arrives at the position of the deposition removing device and/or the deposition removing reagent according to the front actual factory building image identification of the remote operator, the deposition removing device and/or the deposition removing reagent is identified from the front actual factory building image, and the deposition removing device and/or the deposition removing reagent is marked so that the remote operator can identify the deposition removing device and/or the deposition removing reagent conveniently.
In practical use, considering that the remote operator needs to guide the remote operator, the embodiment further provides a preferred implementation manner, that is, a host, a first camera, a microphone and a display screen are disposed in an expert conference room, the host is connected to the manufacturer server after security verification, and the method further includes:
and the host machine carries out face recognition on the real-time conference room image to obtain face information of each participant, and identifies whether each participant is speaking or not according to the face information.
When judging that the corresponding participants are speaking, carrying out semantic recognition on the current real-time voice data to obtain semantic information of the participants.
And transmitting semantic information of the participants to the manufacturer server, wherein the manufacturer server matches the semantic information with each initial device in the initial factory building environment model to obtain first initial devices mentioned by the participants, and judging whether first actual devices corresponding to the first initial devices exist in the front actual factory building image according to a primary matching result.
If there is a first actual device corresponding to the first initial device in the front actual factory building image, the factory server sends guiding information to the terminal, so that the terminal marks and displays the first actual device on the AR device according to the guiding information, as shown in FIG. 21, the voice information of the participants is also displayed on the AR device, and the word matched with the first actual device is highlighted; wherein the instruction information comprises the information of the participants who are speaking and the information related to the first actual equipment.
In practical use, the instruction information is also transmitted to the remote conference room side so as to be displayed on a display screen of the remote conference room side, and the image synchronization of the AR equipment side and the display screen of the remote conference room side is realized. In actual use, the guiding information comprises a participant id, a serial number of the first actual equipment, a contour information length of the first actual equipment and contour information of the first actual equipment, wherein the contour information length is the bit number occupied by the contour information, so that when the guiding information is received by a receiving side (namely a terminal or a host), the corresponding contour information is obtained according to the contour information length in a resolving mode, and the contour information is the contour corresponding to the first actual equipment in a front actual factory building image, so that the contour of the first actual equipment can be marked conveniently.
The method of example 1 is applicable to this example.
In a preferred embodiment, if there is no first actual device corresponding to the first initial device in the front actual plant image, the method further includes: the manufacturer equipment determines the position of a remote operator according to the primary matching result, and generates second guide information according to the position of the remote operator and the position of the first initial equipment in the initial factory building environment model, so as to generate the shortest path to the first initial equipment; and sending the shortest path to the terminal so that the terminal can display the corresponding travelling direction on the AR equipment according to the shortest path and the real-time position of a remote operator.
Under an actual application scene, the manufacturer server matches the semantic information with each initial device in the initial factory building environment model, and specifically comprises the following steps: selecting initial equipment vectors corresponding to all initial equipment from a semantic network; calculating the distance between each semantic vector in the semantic information and each initial equipment vector in the semantic network; wherein the semantic information comprises a plurality of semantic vectors, each semantic vector representing a word. And selecting an initial equipment vector with the minimum distance from the initial equipment vectors with the distances smaller than the preset distance from the corresponding semantic vectors, wherein initial equipment corresponding to the initial equipment vector is an operator matched with the semantic information. The preset distance is obtained by an empirical analysis by a person skilled in the art.
In an alternative embodiment, when there are a plurality of devices in the front actual plant image and the profile similarity between the plurality of devices and the plurality of initial devices in the initial plant environment model is greater than the preset similarity, it may be difficult to distinguish to which initial device the plurality of devices in the front actual plant image specifically should be addressed, in order to solve the problem, the present invention provides a preferred embodiment specifically including:
When the profile similarity between n actual devices and m initial devices is larger than the preset similarity through one-time matching, screening the m initial devices by taking the distance between the n initial devices in the initial factory building environment model smaller than a preset threshold value as a screening condition to obtain n initial devices; the preset threshold is obtained by a person skilled in the art from an empirical analysis.
And carrying out one-to-one matching according to the position distribution of n actual devices in the current actual factory building and the position distribution of n initial devices in the initial factory building environment model.
In this embodiment, considering that in actual use, the viewing angle of the remote operator is limited, the AR device is often only capable of collecting images around the remote operator, so that multiple devices in the front actual plant image are necessarily collected in the same area, that is, the distances between the multiple devices are smaller than a preset threshold, so that screening is performed according to the positional relationship between the multiple devices, that is, unmatched devices in the initial plant environment model can be eliminated, and then each actual device is matched with each initial device according to the relative positional relationship between the multiple devices.
When the contour similarity between a single actual device and a plurality of initial devices obtained by one-time matching is larger than the preset similarity, taking one initial device with the largest contour similarity as a first initial device matched with the actual device.
Example 3:
fig. 22 is a schematic diagram of an architecture of an AR-based multi-person complex interactive conference device according to an embodiment of the present invention. The AR-based multi-person complex interactive conference device of the present embodiment includes one or more processors 21 and a memory 22. In fig. 22, a processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or otherwise, in fig. 22 by way of example.
The memory 22, as a non-volatile computer readable storage medium, may be used to store non-volatile software programs and non-volatile computer executable programs, such as the AR-based multi-person complex interactive conferencing method of example 1. The processor 21 performs the AR-based multi-person complex interactive conference method by running non-volatile software programs and instructions stored in the memory 22.
The memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 22 may optionally include memory located remotely from processor 21, which may be connected to processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, which when executed by the one or more processors 21, perform the AR-based multi-person complex interactive conferencing method in embodiment 1 described above.
It should be noted that, because the content of information interaction and execution process between modules and units in the above-mentioned device and system is based on the same concept as the processing method embodiment of the present invention, specific content may be referred to the description in the method embodiment of the present invention, and will not be repeated here.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the embodiments may be implemented by a program that instructs associated hardware, the program may be stored on a computer readable storage medium, the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (10)

1. The multi-person complex interactive conference method based on AR is characterized in that a first camera, a microphone and a display screen are arranged in a remote conference room, the first camera is used for collecting real-time conference room images in the remote conference room, the microphone is used for collecting real-time voice data of each participating person, the remote operating person wears AR equipment to operate, a second camera is arranged on the AR equipment, and the second camera is used for collecting real-time images in front of the remote operating person, and the method comprises the following steps:
Receiving real-time conference room images and real-time voice data from a remote conference room side, receiving real-time images from an AR device side, transmitting the real-time images to the remote conference room side so as to display the real-time images on the display screen, and transmitting the real-time voice data to the AR device side so as to play the real-time voice data on the AR device side; matching corresponding virtual head portraits for each participating person, and transmitting each participating person and the corresponding virtual head portraits to a remote conference room side and an AR equipment side so as to synchronously display a participating person list on the display screen and the AR equipment;
performing face recognition on the real-time conference room image to obtain face information of each participant, and recognizing whether each participant is speaking or not according to the face information;
when each participant is identified to speak independently, extracting features of the real-time voice data acquired currently to obtain voiceprint features of the participant speaking, and establishing an association relationship between the participant speaking and the voiceprint features;
when a plurality of participants are identified to speak at the same time, identifying a plurality of first participants who speak according to the face information, finding first voiceprint features corresponding to the first participants from the association relationship, performing feature matching on the currently acquired real-time voice data by using the first voiceprint features so as to separate personal audio of the first participants from the real-time voice data, and performing semantic identification on the personal audio of the first participants to obtain semantic information of the first participants;
Identifying the real-time image to obtain each operation object in the real-time image, matching semantic information of each first participant with each operation object in the real-time image, generating guide information according to a matching result, and transmitting the guide information to an AR device side; the instruction information comprises information of each participant who is speaking and operation related information matched with each participant;
the AR equipment marks the speaking participants in the participant list by using different colors according to the instruction information, and marks the operators matched with the participants by using the same colors as the corresponding participants, so that the remote operators can identify the speaking participants and the operators mentioned by the participants according to the different marks.
2. The AR-based multi-person complex interactive conference method according to claim 1, wherein the feature matching is performed on the currently collected real-time voice data by using each first voiceprint feature to separate personal audio of each first participant from the real-time voice data, and the method specifically comprises:
Performing short-time Fourier transform on the real-time voice data acquired currently to obtain a mixed spectrum characteristic;
splicing the mixed spectrum features with the corresponding first voiceprint features to obtain reference spectrum features, and inputting the reference spectrum features into an expansion convolution layer to obtain basic features;
inputting the basic features into a voice separation model, outputting to obtain a spectrum mask, multiplying the spectrum mask by the mixed spectrum features to obtain personal spectrums of corresponding first participants, and recovering the personal spectrums by using the phase spectrums of the real-time voice data acquired currently to obtain personal audios of the corresponding first participants.
3. The AR-based multi-person complex interactive conference method according to claim 1, wherein the identifying the first participants who are speaking according to the face information specifically comprises:
after face information is identified, selecting each face region, detecting a plurality of key feature points of the face region by using a feature point detection model, calculating to obtain a feature value of the face region by using the position relation among the plurality of key feature points, and identifying whether the face region is in a mouth opening state according to the feature value;
If the frame number ratio of the corresponding face area in the mouth opening state is higher than a preset ratio in the continuous multi-frame real-time conference room images, and the variance of the characteristic value of the face area in the multi-frame real-time conference room images in the mouth opening state is larger than a first preset value, identifying that the participant corresponding to the face area is speaking.
4. The AR-based multi-person complex interactive conference method according to claim 3, wherein the plurality of key feature points includes a lip left edge feature point P1, a lip right edge feature point P2, an upper lip left lip peak highest feature point P3, an upper lip right lip peak highest feature point P4, a lower lip left feature point P5 opposite to the upper lip left lip peak highest feature point, and a lower lip right feature point P6 opposite to the upper lip right lip peak highest feature point; the calculating to obtain the feature value of the face region by using the position relation among the plurality of key feature points specifically comprises:
using the difference between P2 and P1 as a first difference, the difference between P3 and P5 as a second difference, and the difference between P4 and P6 as a third difference;
dividing the result obtained by adding the second difference value and the third difference value by the first difference value, and multiplying the result by a preset coefficient to obtain a characteristic value;
And identifying whether the face area is in a mouth opening state according to the characteristic value, wherein the method specifically comprises the following steps of: and when the characteristic value is larger than a first preset value, recognizing that the face area is in a mouth opening state.
5. The AR-based multi-person complex interactive conference method according to claim 1, wherein said matching semantic information of each first participant with each operator in said real-time image specifically comprises:
after identifying and obtaining an operation object, selecting an operation object vector corresponding to the operation object from a semantic network;
calculating the distance between each semantic vector and each operator vector in the semantic information in the semantic network;
selecting one operator vector with the smallest distance from operator vectors with the distance smaller than the preset distance from the corresponding semantic vectors, wherein the operator corresponding to the operator vector is the operator matched with the semantic information of the first participant.
6. The AR-based multi-person complex interactive conferencing method as in claim 5, wherein the method further comprises:
when semantic information of one consultant matches with a plurality of operators, sequencing the operators according to the sequence of semantic vectors corresponding to the operators in the semantic information, and transmitting sequence information obtained by sequencing to AR equipment;
The AR equipment marks a plurality of operators mentioned by the same participant by using the same color as the participant, and marks serial numbers around the operators according to the sequence information; wherein the sequence numbers are used to represent the order in which the respective operators are mentioned.
7. The AR-based multi-person complex interactive conference method according to claim 1, wherein the identifying the real-time image to obtain each operator in the real-time image is based on performing contour analysis on the implemented image, and matching each contour obtained by the analysis in a preset object library; when the corresponding first contour is not matched with the corresponding object in the preset object library, marking a dead zone mark at the position of the first contour in the real-time image so as to remind a remote operator to manually identify the first contour.
8. The AR-based multi-person complex interactive conferencing method of claim 7, wherein the remote operator manually identifies the first profile, comprising:
when recognizing that the gesture of the remote operator points to the position of the first contour and collecting voice data of the remote operator, carrying out semantic recognition on the voice data to obtain an operator corresponding to the first contour.
9. The AR-based multi-person complex interactive conferencing method as in claim 1, wherein the method further comprises:
and displaying semantic information of the participants beside the virtual head portraits corresponding to the participants, and playing personal audio of the participants when recognizing that the gestures of the remote operators point to the semantic information of the corresponding participants.
10. An AR-based multi-person complex interactive conferencing device, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the AR-based multi-person complex interactive conferencing method of any of claims 1-9.
CN202410133163.0A 2024-01-31 2024-01-31 AR-based multi-person complex interactive conference method and device Active CN117676072B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410133163.0A CN117676072B (en) 2024-01-31 2024-01-31 AR-based multi-person complex interactive conference method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410133163.0A CN117676072B (en) 2024-01-31 2024-01-31 AR-based multi-person complex interactive conference method and device

Publications (2)

Publication Number Publication Date
CN117676072A CN117676072A (en) 2024-03-08
CN117676072B true CN117676072B (en) 2024-04-09

Family

ID=90064543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410133163.0A Active CN117676072B (en) 2024-01-31 2024-01-31 AR-based multi-person complex interactive conference method and device

Country Status (1)

Country Link
CN (1) CN117676072B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019061594A (en) * 2017-09-28 2019-04-18 株式会社野村総合研究所 Conference support system and conference support program
CN110266992A (en) * 2019-06-24 2019-09-20 苏芯物联技术(南京)有限公司 A kind of long-distance video interactive system and method based on augmented reality
CN113783305A (en) * 2021-09-27 2021-12-10 国能陕西水电有限公司 AR-based power station integrated management method, system and server
CN115131405A (en) * 2022-07-07 2022-09-30 沈阳航空航天大学 Speaker tracking method and system based on multi-mode information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10142596B2 (en) * 2015-02-27 2018-11-27 The United States Of America, As Represented By The Secretary Of The Navy Method and apparatus of secured interactive remote maintenance assist
US20180338119A1 (en) * 2017-05-18 2018-11-22 Visual Mobility Inc. System and method for remote secure live video streaming
US20220354440A1 (en) * 2021-05-04 2022-11-10 Willis Dennis Grajales Worldwide vision screening and visual field screening booth, kiosk, or exam room using artificial intelligence, screen sharing technology, and telemedicine video conferencing system to interconnect patient with eye doctor anywhere around the world via the internet using ethernet, 4G, 5G, 6G or Wifi for teleconsultation and to review results

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019061594A (en) * 2017-09-28 2019-04-18 株式会社野村総合研究所 Conference support system and conference support program
CN110266992A (en) * 2019-06-24 2019-09-20 苏芯物联技术(南京)有限公司 A kind of long-distance video interactive system and method based on augmented reality
CN113783305A (en) * 2021-09-27 2021-12-10 国能陕西水电有限公司 AR-based power station integrated management method, system and server
CN115131405A (en) * 2022-07-07 2022-09-30 沈阳航空航天大学 Speaker tracking method and system based on multi-mode information

Also Published As

Publication number Publication date
CN117676072A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN112184705B (en) Human body acupuncture point identification, positioning and application system based on computer vision technology
CN102110399B (en) A kind of assist the method for explanation, device and system thereof
EP2529355B1 (en) Voice-body identity correlation
CN113228124B (en) Image processing method and device, electronic equipment and storage medium
Kessous et al. Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis
JP4212274B2 (en) Speaker identification device and video conference system including the speaker identification device
KR101636716B1 (en) Apparatus of video conference for distinguish speaker from participants and method of the same
KR20120072009A (en) Interaction recognition apparatus for multiple user and method thereof
CN108596148B (en) System and method for analyzing labor state of construction worker based on computer vision
CN108874126A (en) Exchange method and system based on virtual reality device
CN110545396A (en) Voice recognition method and device based on positioning and denoising
EP2538372A1 (en) Dynamic gesture recognition process and authoring system
CN113885700B (en) Remote assistance method and device
AU2020309094B2 (en) Image processing method and apparatus, electronic device, and storage medium
CN108318042A (en) Navigation mode-switching method, device, terminal and storage medium
CN117676072B (en) AR-based multi-person complex interactive conference method and device
CN117292601A (en) Virtual reality sign language education system
Cirik et al. Following formulaic map instructions in a street simulation environment
Jayagopi et al. The vernissage corpus: A multimodal human-robot-interaction dataset
CN117978950B (en) AR-based field video conference method and device
CN113570732A (en) Shield maintenance auxiliary method and system based on AR technology
KR20110125524A (en) System for object learning through multi-modal interaction and method thereof
CN112951236A (en) Voice translation equipment and method
CN111881807A (en) VR conference control system and method based on face modeling and expression tracking
CN113965550B (en) Intelligent interactive remote auxiliary video system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant