CN110807370A - Multimode-based conference speaker identity noninductive confirmation method - Google Patents

Multimode-based conference speaker identity noninductive confirmation method Download PDF

Info

Publication number
CN110807370A
CN110807370A CN201910968323.2A CN201910968323A CN110807370A CN 110807370 A CN110807370 A CN 110807370A CN 201910968323 A CN201910968323 A CN 201910968323A CN 110807370 A CN110807370 A CN 110807370A
Authority
CN
China
Prior art keywords
word
speaker
algorithm
data
conference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910968323.2A
Other languages
Chinese (zh)
Other versions
CN110807370B (en
Inventor
杨理想
王云甘
周亚
孙振平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xingyao Intelligent Technology Co ltd
Original Assignee
Nanjing Shixing Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shixing Intelligent Technology Co Ltd filed Critical Nanjing Shixing Intelligent Technology Co Ltd
Priority to CN201910968323.2A priority Critical patent/CN110807370B/en
Publication of CN110807370A publication Critical patent/CN110807370A/en
Application granted granted Critical
Publication of CN110807370B publication Critical patent/CN110807370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a multimode-based method for confirming identity of a speaker in a conference without sense, which confirms the identity of the speaker by identifying the expression, the sound and the speaking style of the speaker in the multimode conference by using images, voice and text. The method can realize the whole-course automation of the whole process without manual intervention, can realize the noninductive confirmation of the identity of the speaker through the artificial intelligent algorithm model without manual intervention, greatly improves the efficiency of meetings and offices, and has higher accuracy.

Description

Multimode-based conference speaker identity noninductive confirmation method
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a multimode-based conference speaker identity noninductive confirmation method.
Background
With the development of economy, high-efficiency office is more and more not away from a conference system, and in the current stage, for convenience of summarizing and reporting, many conference systems need to record the speaking content of each speaker. Therefore, there is a need for an intelligent, fast speaker differentiation method that addresses this need.
At present, in a conference system at the present stage, speakers are mostly used for recording voices to record speaking contents, if different speakers are to be distinguished, one microphone needs to be allocated to each speaker, but if a plurality of microphones are allocated, crosstalk problems may be caused, that is, because the distances are too close, a plurality of microphones during one person speaking can be recognized, and when one person speaks, other microphones need to be closed to distinguish the speakers. Therefore, there is a need for a speaker identity non-sensory confirmation method that can be multimodal based on images, voice, text, etc.
Disclosure of Invention
In order to solve the complicated problem that the distance adjustment needs to be carried out for closing and opening microphones at different positions for many times to distinguish different speakers caused by traditional microphone allocation and establishment in a conventional conference, the invention provides a multimode-based conference speaker identity noninductive confirmation method, which specifically comprises the following steps: the method for automatically identifying and distinguishing the speakers in the conference by automatically identifying the three aspects of the expressions, the sounds and the speaking styles of the speakers comprises an expression identification method based on a deep learning model, a sound identification method based on an artificial intelligence algorithm and a speech content identification method based on a text clustering algorithm.
As an improvement, in the expression recognition method based on the deep learning model, firstly, the face photo information of the speech in the conference site is collected, the operations of random interference, deformation, rotation and the like are performed through information preprocessing, then a plurality of groups of training sets are generated by using a Gan network, then the fast R-Cnn model is adopted to train sample data, and finally the deep learning model is generated.
As an improvement, the voice recognition method comprises the following specific steps:
(1) data acquisition and processing
Acquiring conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, and processing and denoising the data by taking each segment as a processing unit;
(2) building models and training
Suppose that a plurality of voices of a plurality of persons are in the training data voice, wherein the jth voice of the ith person is defined as XijThe construction model is as follows: xij=μ+Fhi+Gwij+∈ijWhere μ is the mean value of the data, FhiAnd GwijIs a spatial feature matrix, eijIs the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) model testing
Calculating whether two speeches are the same speaker according to the characteristic h in the speaker spaceiThe likelihood degree generated or generated by hi is generated by calculating a score through a log likelihood ratio score, and the calculation formula is as follows:
η therein1,η2Which represents the two test voices,
Figure BDA0002231241940000022
and
Figure BDA0002231241940000023
respectively representing that the two test voices come from the same space and come from different space hypotheses;
Figure BDA0002231241940000024
representation η1,η2Probabilities from a unified space;
Figure BDA0002231241940000025
and
Figure BDA0002231241940000026
respectively, representing the probability of belonging to respectively different spaces.
As an improvement, the method for recognizing the speech content by adopting the text clustering algorithm comprises sentence vector representation and text clustering, wherein all sentence vector representations are firstly carried out, and then all sentence vector representations are subjected to text clustering by the DBSCAN algorithm.
As an improvement, word vector training is carried out on the text by adopting a Skip-gram model of a word2vec tool to form a word vector matrix X belonging to RmnWith xi∈RmThe method comprises the following steps of representing word vectors of a characteristic word i in an m-dimensional space, and expressing Euclidean distance between the two vectors, wherein the formula is as follows: d (w)i,wj)=|xi-xj|2Wherein d (w)i,wj) Representing semantic distances of the feature words i and the feature words j; x is the number ofiAnd xjRepresentation of a feature word wiAnd wjThe corresponding word vector.
As an improvement, the Skip-gram model comprises an input layer, a projection layer and an output layer; wherein, the input layer is the current characteristic word, and the word vector is marked as Wt∈RmThe output layer is the probability of the occurrence of the words in the context window of the feature words; the projection layer is used to maximize the value of the objective function L.
As an improvement, assume that there is a set of word sequences w1,w2,…,wNThe formula of the objective function is:
Figure BDA0002231241940000027
wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 word lengths; p (w)j+1|wj) For the known current word wjProbability of occurrence, its context feature word wj+1The probability of occurrence.
As an improvement, when text clustering is performed on all sentence vector representations through the DBSCAN algorithm, under the condition that the number of speakers is known, the cluster number corresponding to the number of speakers is obtained by adjusting the parameter radius and the minimum point number of the algorithm, so as to obtain corresponding text clusters, and then the speech contents of different speakers are separated.
Has the advantages that: the invention provides a multimode-based conference speaker identity non-sensory confirmation method, which is used for confirming the identity of a speaker by identifying the expression, the sound and the speaking style of the speaker during a multimode conference by using images, voices and texts.
Drawings
FIG. 1 is a flow chart of the principle structure of the present invention.
FIG. 2 is a schematic diagram of DBSCAN algorithm of the present invention.
Detailed Description
The figures of the present invention are further described below in conjunction with the embodiments.
The invention is a multi-modal conference based on images, voice and text, and confirms the identity of a speaker by identifying the expression, the sound and the speaking style of the speaker, so that the whole process can be automatic without manual intervention, and specifically:
(1) speaker expression recognition
The expressions of the participants are recognized through the conference site real-time video based on the deep learning model, the speaking states of the participants are judged, and the speakers are confirmed;
(2) speaker voice recognition
The voice of each person has great difference in frequency and tone, and the speaker is distinguished based on an artificial intelligence algorithm through the real-time voice of the conference site, so that the identity of the speaker is determined;
(3) speaker speech style recognition
Each person speaks and all has own style, and when two kinds of effects are not good in the front, the clustering algorithm can be adopted through speech content text information after speech recognition, and the paragraphs of the corresponding number categories are classified according to the known number of speakers, so that the speech is distinguished and the identities are distinguished.
Aiming at the expression recognition of a speaker, collected face photo information of the speaker in a conference scene is subjected to information preprocessing including operations such as random interference, deformation, rotation and the like, then a plurality of groups of training sets are generated by utilizing a Gan network, then a Faster R-Cnn model is adopted to train sample data, and finally a deep learning model is generated.
Example 1
About 1000 pictures of the face of a speaker in a conference site are collected, the pictures are manually classified into two categories of speaking and non-speaking, then more training sets are generated by using a Gan network through basic operations such as random interference, deformation, rotation and the like, and a data set which is about 10 times of a source data set is obtained. Then, the Faster R-Cnn model is adopted to train sample data, and the accuracy of the model reaches 85 percent finally.
Aiming at speaker voice recognition, the specific implementation modes of the invention are as follows: 1) data acquisition: the conference site collects voice data in real time, and segments the data every 4-8 seconds, preferably every 5 seconds, and each segment is used as a processing unit; 2) data processing: because the speech of the conference site is more standard, the speech is mostly Mandarin, and the conference site is quieter and has less noise, the data processing is basically not needed; 3) constructing a model: assume that the training data speech consists of the speech of I speakers, where each speaker has J segments of his own distinct speech. Then, define the jth voice of the ith speaker asXij. Then, according to the factor analysis, X is definedijThe generated model is:
Xij=μ+Fhi+Gwij+∈ij
where μ is the mean value of the data, FhiAnd GwijIs a spatial feature matrix, eijIs the noise covariance. This model can be viewed as two parts: the first two items on the right of the equal sign are only related to the speaker and are not related to a specific certain voice of the speaker, and are called signal parts, which describe the difference between the speakers; the last two items to the right of the equal sign describe the difference between different voices of the same speaker, called the noise part.
Two imaginary variables are used to describe the data structure of a piece of speech. The middle two items on the right of the equal sign are respectively a matrix and a vector representation, which is another core part of factor analysis. The two matrices F and G contain basic factors in the respective imaginary variable space, which can be regarded as eigenvectors of the respective space. For example, each column of F corresponds to a feature vector of an inter-class space, and each column of G corresponds to a feature vector of an intra-class space. And hiAnd wiRespectively representing the representation of the features of F and G in respective spaces, e.g. hiCan be regarded as xijA representation of features in speaker space. In the stage of identification scoring, if h of two voices isiThe greater the likelihood that the features are the same, the more certain the two utterances belong to the same speaker. 4) Model training: mu FhiGwijijAnd the training process of the model adopts an EM algorithm to carry out iterative solution. 5) And (3) testing a model: calculating whether two speeches are determined by the feature h in the speaker spaceiIs generated or formed by hiThe generated likelihood degree is calculated by using a log likelihood ratio score, and the calculation formula is as follows:
Figure BDA0002231241940000041
η therein1,η2Which represents the two test voices,
Figure BDA0002231241940000042
andrespectively representing that the two test voices come from the same space and come from different space hypotheses;
Figure BDA0002231241940000044
representation η1,η2Probabilities from the same space;and
Figure BDA0002231241940000051
respectively, representing the probability of belonging to respectively different spaces. By calculating the log-likelihood ratio, the degree of similarity between two voices can be measured, i.e., the higher score, the greater the likelihood that two voices belong to the same speaker.
Aiming at the recognition of the speaking style of a speaker, a method for recognizing the speaking content by adopting a text clustering algorithm comprises sentence vector representation and text clustering, wherein all sentence vector representation is firstly carried out, and then all sentence vector representation is subjected to text clustering by adopting a DBSCAN algorithm.
1) Sentence vector representation
The invention adopts a Skip-gram model of a Word2vec tool to carry out Word vector training on the text. The model is a Huffman tree constructed based on Hierarchical Softmax, and can predict the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input words, namely, words appearing around can be predicted according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, and therefore word vectors generated by each feature word contain certain text structure information and semantic information.
The Skip-gram model includes an input layer, a projection layer, and an output layer. Wherein, the input layer is a current characteristic word, a word vector Wt∈Rm(ii) a The output layer is the probability of the occurrence of the words in the context window of the feature words; the purpose of the projection layer is to maximize the value of the objective function L. Assuming a set of word sequences w1,w2,…,wNThe formula of the objective function is:
Figure BDA0002231241940000052
in the above formula, N is the length of the word sequence; c represents the context length of the current characteristic word, and the length of the current characteristic word is generally 5-10 words; p (w)j+1|wj) For the known current word wjProbability of occurrence, its context feature word wj+1The probability of occurrence.
All word vectors obtained through Skip-gram model training form a word vector matrix X belonging to Rmn. With xi∈RmA word vector representing the feature word i in m-dimensional space. The similarity between feature words can be measured using the distance between corresponding word vectors. The Euclidean distance between two vectors is shown as the following formula:
d(wi,wj)=|xi-xj|2
in the formula: d (w)i,wj) Representing semantic distances of the feature words i and the feature words j; x is the number ofiAnd xjRepresentation of a feature word wiAnd wjThe corresponding word vector. d (w)i,wj) The smaller the value of (A), the smaller the semantic distance between two characteristic words is, the more similar the semantics is, and finally, the sentence vectors are obtained by adding the word vectors.
2) Text clustering
When clustering is carried out on all sentence vector representations by using a clustering method, a DBSCAN algorithm which is a density-based algorithm is adopted. DBSCAN divides sample points into three classes, where sample points are here vector representations: core point: the number of samples in the neighborhood of the core point is equal to or greater than the minimum number of samples. The field here is the area within a specified radius. Edge points: an edge point is not a core point, but there is a core point in its neighborhood. Noise points: the noise points are points other than the core points and the edge points. This is the visual effect of three types of points, where A is the core point, B, C is the edge point, and N is the noise point, as shown in FIG. 2.
The first step is as follows: the samples are classified into core points and non-core points according to the number of samples in the neighborhood.
The second step is that: and dividing the non-core points into edge points and noise points according to whether the core points exist in the neighborhood or not.
The third step: one cluster is initialized for each point.
The fourth step: and selecting a core point, traversing the samples in the neighborhood of the core point, and combining the clusters of the core point and the samples.
The fifth step: the fourth step is repeated until all core points have been visited.
Under the condition that the number of speakers is known, the cluster number corresponding to the number of speakers is obtained by adjusting the parameter radius and the minimum point number value of the algorithm, corresponding text clusters are obtained, and then the speaking contents of different speakers are separated.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A multimode-based conference speaker identity noninductive confirmation method is characterized by comprising the following steps: the method is a method for automatically identifying and distinguishing conference speakers by three aspects of expressions, sounds and speaking styles of the speakers, and comprises an expression identification method based on a deep learning model, a sound identification method based on an artificial intelligence algorithm and a speech content identification method based on a text clustering algorithm.
2. The method of claim 1, wherein the method comprises: in the expression recognition method based on the deep learning model, firstly, the picture information of the face of a person speaking in a conference site is collected, the picture information is subjected to information preprocessing including random interference, deformation and rotation, then a plurality of groups of training sets are generated by utilizing a Gan network, then the fast R-Cnn model is adopted to train sample data, and finally the deep learning model is generated.
3. The method of claim 1, wherein the method comprises: the voice recognition method comprises the following specific steps:
(1) data acquisition and processing
Acquiring conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, and taking each segment as a processing unit and performing denoising processing on the data;
(2) building models and training
Suppose that there are multiple persons and multiple voices in the training data voice, wherein the jth voice of the ith person is defined as XijThe construction model is as follows: xij=μ+Fhi+Gwij+∈ijWhere μ is the mean value of the data, FhiAnd GwijIs a spatial feature matrix, eijIs the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) model testing
Calculating whether two speeches are the same speaker according to the characteristic h in the speaker spaceiIs generated or formed by hiThe generated likelihood degree is generated by calculating a score through a log likelihood ratio score, and the calculation formula is as follows:
Figure FDA0002231241930000011
η therein1,η2Which represents the two test voices,and
Figure FDA0002231241930000013
respectively representing that the two test voices come from the same space and come from different space hypotheses;
Figure FDA0002231241930000014
representation η1,η2Probabilities from the same space;and
Figure FDA0002231241930000016
respectively represent η1,η2Probabilities of belonging to respective different spaces.
4. The method of claim 1, wherein the method comprises: a method for recognizing the speech content by adopting a text clustering algorithm comprises sentence vector representation and text clustering, wherein all sentence vector representations are firstly carried out, and then all sentence vector representations are subjected to text clustering through a DBSCAN algorithm.
5. The multimodal based conference speaker identity insensitive validation method as claimed in claim 4, wherein: word vector training is carried out on the text by adopting a Skip-gram model of a word2vec tool to form a word vector matrix X belonging to RmnWith xi∈RmThe method comprises the following steps of representing word vectors of a characteristic word i in an m-dimensional space, and expressing Euclidean distance between the two vectors, wherein the formula is as follows: d (w)i,wj)=|xi-xj|2Wherein d (w)i,wj) Representing semantic distances of the feature words i and the feature words j; x is the number ofiAnd xjRepresentation of a feature word wiAnd wjThe corresponding word vector.
6. The multimodal based conference speaker identity insensitive validation method as claimed in claim 5, wherein: the Skip-gram model comprises an input layer, a projection layer and an output layer; whereinThe input layer is the current characteristic word, and the word vector is marked as Wt∈RmThe output layer is the probability of the occurrence of the words in the context window of the feature words; the projection layer is used to maximize the value of the objective function L.
7. The multimodal based conference speaker identity insensitive validation method as claimed in claim 6, wherein: assuming a set of word sequences w1,w2,…,wNThe formula of the objective function is:
Figure FDA0002231241930000021
wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 word lengths; p (w)j+1|wj) For the known current word wjProbability of occurrence, its context feature word wj+1The probability of occurrence.
8. The multimodal based conference speaker identity insensitive validation method as claimed in claim 4, wherein: when text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of speakers is known, the cluster number corresponding to the number of speakers is obtained by adjusting the parameter radius and the minimum point number value of the algorithm, corresponding text clusters are obtained, and then the speaking contents of different speakers are separated.
CN201910968323.2A 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes Active CN110807370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910968323.2A CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910968323.2A CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Publications (2)

Publication Number Publication Date
CN110807370A true CN110807370A (en) 2020-02-18
CN110807370B CN110807370B (en) 2024-01-30

Family

ID=69488298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910968323.2A Active CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Country Status (1)

Country Link
CN (1) CN110807370B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113746822A (en) * 2021-08-25 2021-12-03 安徽创变信息科技有限公司 Teleconference management method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110046941A1 (en) * 2009-08-18 2011-02-24 Manuel-Devados Johnson Smith Johnson Advanced Natural Language Translation System
CN107993665A (en) * 2017-12-14 2018-05-04 科大讯飞股份有限公司 Spokesman role determines method, intelligent meeting method and system in multi-conference scene
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110046941A1 (en) * 2009-08-18 2011-02-24 Manuel-Devados Johnson Smith Johnson Advanced Natural Language Translation System
CN107993665A (en) * 2017-12-14 2018-05-04 科大讯飞股份有限公司 Spokesman role determines method, intelligent meeting method and system in multi-conference scene
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113746822A (en) * 2021-08-25 2021-12-03 安徽创变信息科技有限公司 Teleconference management method and system

Also Published As

Publication number Publication date
CN110807370B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN108766414B (en) Method, apparatus, device and computer-readable storage medium for speech translation
US9875743B2 (en) Acoustic signature building for a speaker from multiple sessions
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
Shi et al. H-vectors: Utterance-level speaker embedding using a hierarchical attention model
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN108735200A (en) A kind of speaker's automatic marking method
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
Bellagha et al. Speaker naming in tv programs based on speaker role recognition
CN111653270A (en) Voice processing method and device, computer readable storage medium and electronic equipment
US20120116763A1 (en) Voice data analyzing device, voice data analyzing method, and voice data analyzing program
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
CN108597497B (en) Subtitle voice accurate synchronization system and method and information data processing terminal
CN108629024A (en) A kind of teaching Work attendance method based on voice recognition
CN110265003B (en) Method for recognizing voice keywords in broadcast signal
Abd El-Moneim et al. Effect of reverberation phenomena on text-independent speaker recognition based deep learning
CN112434516B (en) Self-adaptive comment emotion analysis system and method for merging text information
CN112820274B (en) Voice information recognition correction method and system
CN111933187B (en) Emotion recognition model training method and device, computer equipment and storage medium
US20240160849A1 (en) Speaker diarization supporting episodical content
Mingliang et al. Chinese dialect identification using clustered support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210311

Address after: 210000 rooms 1201 and 1209, building C, Xingzhi Science Park, Qixia Economic and Technological Development Zone, Nanjing, Jiangsu Province

Applicant after: Nanjing Xingyao Intelligent Technology Co.,Ltd.

Address before: Room 1211, building C, Xingzhi Science Park, 6 Xingzhi Road, Nanjing Economic and Technological Development Zone, Jiangsu Province, 210000

Applicant before: Nanjing Shixing Intelligent Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant