CN116246610A - Conference record generation method and system based on multi-mode identification - Google Patents

Conference record generation method and system based on multi-mode identification Download PDF

Info

Publication number
CN116246610A
CN116246610A CN202211727305.3A CN202211727305A CN116246610A CN 116246610 A CN116246610 A CN 116246610A CN 202211727305 A CN202211727305 A CN 202211727305A CN 116246610 A CN116246610 A CN 116246610A
Authority
CN
China
Prior art keywords
speaker
voiceprint
voice information
identity
conference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211727305.3A
Other languages
Chinese (zh)
Inventor
陈纪旸
马思乐
陈振学
吴书胜
张建成
梁田
鹿全礼
郭峰
郭锐
宋丽华
许志国
李士宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Center Information Technology Ltd By Share Ltd
Shandong University
Original Assignee
Shandong Center Information Technology Ltd By Share Ltd
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Center Information Technology Ltd By Share Ltd, Shandong University filed Critical Shandong Center Information Technology Ltd By Share Ltd
Priority to CN202211727305.3A priority Critical patent/CN116246610A/en
Publication of CN116246610A publication Critical patent/CN116246610A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a conference record generation method and system based on multi-mode identification, comprising the following steps: acquiring voice information during a conference and extracting voiceprint features; if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source; and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file. The conference record can be generated by carrying out multi-mode recognition modes such as lip language recognition, sound source positioning, voiceprint recognition and the like through the facial action image and the voice information of the speaker, and the voice data can be classified according to the identities of different people, so that the pain point of manually distinguishing the identities of the speaker in the conference record after the conference is finished is solved, and a large amount of workload is reduced.

Description

Conference record generation method and system based on multi-mode identification
Technical Field
The invention relates to the technical field of data processing, in particular to a conference record generation method and system based on multi-mode identification.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The conference record is data showing conference contents, is currently manually recorded by a rapid recorder or is generated by manual recording after being recorded by recording equipment, and the like, and can record information such as the conference contents through the recording equipment in the later period, and perform operations such as sentence breaking, segmentation and the like on the conference contents through a voiceprint recognition mode.
The manual mode is mainly used for recording keywords by conference site fast recording personnel, searching the keywords in the conference recordings and acquiring recordings nearby the keywords, so that the keywords are expanded to form the conference recordings. Because the corresponding relation between the key words and the record is weaker, the fast recording personnel need to search through manual repeated positioning. The operation consumes a large amount of time and effort, and the operation is troublesome. In addition, the same keyword may appear many times, and the keyword in the recording is positioned manually, so that the situation of incorrect positioning may occur, and the situation of recording errors in the conference recording may occur.
And the voice print recognition technology is used for analyzing the recorded content of the conference, so that the method is only suitable for conference sites with smaller scales. If the conference scale is large, the number of participants is large, and when the situation that a plurality of persons speak simultaneously occurs, the voiceprint characteristics of each person cannot be accurately distinguished. Meanwhile, the noisy condition of the conference site can also influence the sound recording effect and the voiceprint recognition accuracy. In addition, voiceprint recognition requires comparison of voiceprint libraries, and voiceprint acquisition is required before a conference starts, which causes a series of inconveniences.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a conference record generation method and a conference record generation system based on multi-modal identification, which are used for conference record by using an autonomous learning multi-modal identification mode. The multi-mode recognition comprises video lip language recognition, sound source positioning and voiceprint recognition, and conference recording, recording and acquiring, identity recognition and other operations are performed by the three technologies.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a first aspect of the present invention provides a conference record generating method based on multi-modal identification, including the steps of:
acquiring voice information during a conference and extracting voiceprint features;
if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
If the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information, wherein the voice information comprises the following specific steps of: comparing the collected voiceprint characteristics with voiceprints stored in a voiceprint library, and if the voiceprint characteristics are in the voiceprint library, determining the identity of a speaker and storing voice information.
After the position of the speaker is determined, lip language recognition is carried out based on the facial action image of the speaker, and the lip language recognition is compared with the acquired voice information to determine the identity of the speaker and write the voiceprint characteristics of the speaker into a voiceprint library.
Semantic analysis of the voice information includes loading the voice information and identifying semantics in the voice information.
The semantic analysis of the voice information further comprises lip language recognition based on the facial action image of the speaker to strengthen the semantic recognition effect.
And carrying out semantic analysis on the voice information and labeling the identity of the speaker, wherein the step of labeling the identity of the speaker according to voiceprint features in the voice information.
The output is a text file, including a video file formed from a facial motion image of a speaker and a voice file formed from voice information, converted into a text file.
A second aspect of the present invention provides a system for implementing the above method, comprising:
a voice acquisition module configured to: acquiring voice information during a conference and extracting voiceprint features;
a voiceprint recognition module configured to: if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
the voice text conversion module is configured to: and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
A third aspect of the present invention provides a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a method of conference record generation based on multimodal recognition as described above.
A fourth aspect of the invention provides a computer device.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the multimodal recognition based meeting record generation method as described above when the program is executed.
Compared with the prior art, the above technical scheme has the following beneficial effects:
1. the conference record can be generated by carrying out multi-mode recognition modes such as lip language recognition, sound source positioning, voiceprint recognition and the like on facial action images and voice information of the speaker, the identity of the speaker can be recognized, the voice data are classified according to the identities of different people, and the organized conference record is automatically generated. After the conference is finished, the pain points of the identities of the speakers in the conference recording are manually distinguished by conference recording staff, so that the workload is reduced.
2. The face action image of the speaker is utilized to identify the lip language, and the voice and the video of the conference are processed by matching with the voiceprint identification, so that the identity of the speaker in the participants can be automatically acquired and identified, the voiceprint characteristics of the participants are not required to be acquired before the conference starts, and the workflow and the workload before the conference starts are simplified.
3. By combining the lip language recognition technology and the voice acquisition technology, the accuracy of voice acquisition and voice conversion into text files can be improved to the greatest extent.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a schematic diagram of a multi-modal identification-based meeting record generation flow provided by one or more embodiments of the present invention;
FIG. 2 is a schematic diagram of a conference recording text file export process provided by one or more embodiments of the present invention;
FIG. 3 is a schematic diagram of a voiceprint registration process provided by one or more embodiments of the present invention;
FIG. 4 is a schematic diagram of a basic voiceprint verification process provided by one or more embodiments of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
As described in the background art, the manner in which the meeting summary is manually generated is prone to errors, and the manner in which voiceprint recognition is prone to interference accuracy is common.
The following embodiments therefore provide a method and system for generating a conference recording based on multi-modal recognition, which uses a microphone matrix in combination with a TDOA (Time Difference Of Arrival ) algorithm to perform a sound source localization function. And a 360-degree video acquisition module is formed by using three groups of wide-angle lenses, so as to acquire and analyze the facial actions of the human face. And the voice print recognition function is completed by using an end-to-end method while the audio and video are collected.
Embodiment one:
as shown in fig. 1-4, the conference record generating method based on multi-modal identification includes the following steps:
acquiring voice information during a conference and extracting voiceprint features;
if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
Specific:
multimodal recognition, including video lip recognition, sound source localization, and voiceprint recognition. The multi-mode recognition mode can effectively solve the problems that conference participants are more, the conference room is larger, the conference record has poor radio effect under the conditions of noisy conference environment and the like, voiceprint recognition is inaccurate, identity information cannot be confirmed and the like. The voice collection is performed in a mutually auxiliary mode by using the audio collection and lip language recognition technology, so that the accuracy of conference voice recording can be improved to the greatest extent, and the quality of conference recording is ensured. And automatically distinguishing identities of participants by utilizing video acquisition and voiceprint recognition technology, and storing voiceprints. The voiceprint acquisition is not needed before the meeting starts, so that unnecessary trouble is reduced. The accuracy of converting the voice into the characters is improved by combining the lip language recognition technology with the voice recognition technology. And the workload of post-meeting record proofreading is reduced.
The method comprises three parts of sound source positioning, voice collection, lip language recognition, voiceprint recognition and voice-text conversion.
The sound source positioning method comprises beam forming, super-resolution general estimation and TDOA algorithm, wherein the relation between the sound source and the array is converted into a space beam, a space spectrum and an arrival time difference respectively, and positioning is carried out through corresponding information. The speaker's location is located based on the sound source direction and examples of the sound source and microphone matrix using sound source localization techniques.
After the position of the speaker is determined, the facial actions of the speaker are collected, the lip language of the speaker is identified by utilizing a lip language identification technology, the lip language is compared with the collected voice, and the identity of the speaker is confirmed by utilizing a voiceprint identification technology and is stored in a voiceprint library.
The lip recognition technology can adopt P2Pnet-P2Cnet (study and implementation of lip recognition application based on deep learning). The Chinese lip language recognition process is divided into two sub-problem processes by understanding the nature of lip language recognition and the pronunciation rules of Chinese language: the continuous lip picture frame mapping is the Pinyin sequence recognition of Pinyin sentences, namely 'LipPic to Pinyin' (P2P for short) and the Pinyin sequence translation is the Chinese Character sequence recognition of Chinese Character sentences, namely 'Pinyin to Chinese-Character' (P2 CC for short) so as to reduce the difficulty of the lip recognition problem.
The voiceprint recognition comprises two processes, wherein the voiceprint registration in the first process is carried out, the collected voice signals are extracted, the collected image information of the speaker is matched with the extracted voiceprints according to the lip recognition and the sound source localization, and the voice signals are stored in a voiceprint library according to the identity information. Second part voiceprint verification: and comparing the subsequently acquired sound information with the stored information heating characteristics in the voiceprint library, and if the comparison is successful, directly confirming the identity of the speaker. If no corresponding voiceprint information is found in the voiceprint library, the identity of the speaker and the voiceprint characteristic information are obtained through sound source positioning and lip language identification, and the voiceprint characteristic information is stored in the voiceprint library.
The multi-mode identification method can accurately identify the identity information of the speaker, so that the voice and the personnel identity are matched. The accuracy of lip language recognition and voice recognition can be improved through autonomous learning training.
In this embodiment, the identity of the speaker may include information such as an image, identity, and name acquired by the video. The image can be automatically acquired, the identity name is set information, and the set information can be manually added.
2. The method can locate the position of the speaker by utilizing the sound source locating method, the video acquisition module acquires the facial action of the speaker, the lip language identification method is used for identifying the content of the speaker, then the content is compared with the information acquired by the voice, the information of the speaker can be recorded if the content is the same, the image information and the voiceprint information of the speaker are written into the voiceprint library, and the information such as the identity name of the speaker can be input manually
After the conference starts, the microphone matrix may be used to continuously collect voice information. Taking one section of voice as an example, as shown in fig. 1:
1. when a voice signal is acquired, voiceprint feature extraction is carried out on the voice signal;
2. comparing the collected voiceprint characteristics with voiceprints stored in a voiceprint library;
3. if the voiceprint features are in the voiceprint library, directly confirming the identity of a speaker and recording voice information;
4. if the collected voiceprint features are recorded in the voiceprint library, a sound source positioning technology is used for positioning the position of the speaker, and the facial action of the speaker is collected in a video mode;
5. the identity of the speaker is further confirmed by comparing the lip language identification with the voice information;
6. extracting voiceprint features in voice, registering the voiceprint, and writing the voiceprint features into a voiceprint library;
after the conference is over, it can be manually selected whether to export the text file of the conference record. The conference recording text file export flow is as follows, as shown in fig. 2:
1. loading a conference voice file and identifying voice;
2. according to the context, analyzing the meaning to prevent homonym errors;
3. aiming at the voice fuzzy or the voice part with larger interference, the lip language recognition technology is used for enhancing the voice recognition effect;
4. using voiceprint recognition technology, according to voiceprint characteristics, performing dialogue management on the voice file recorded by the conference, and marking the identity of a speaker;
5. synthesizing a complete voice file according to the voice file and the video file;
6. and converting the voice file into a text file and outputting the text file of the conference record.
The speaker recognition application is used, i.e. the problem of "one more" to determine which of several speakers a certain speech is speaking.
1. Initiating voiceprint registration;
2. voice activity detection (Voice Activity Detection, VAD), collecting voice data;
3. carrying out enhancement processing on the collected voice data and amplifying voice signals;
4. detecting the quality of voice data, and extracting effective voice for analysis;
5. and extracting voiceprint features, matching the voiceprint features with the identity of the speaker, and numbering.
The speaker identity can be automatically identified, the voice data can be classified according to the identities of different people, and the organized conference records can be automatically generated. After the conference is finished, the pain points of the identity of the speaker in the conference recording are manually distinguished by conference recording personnel. And a great amount of workload is reduced.
The conference voice and video are processed by using the image acquisition lip recognition and voiceprint recognition technology, so that the identities of speakers in the participants can be automatically acquired and recognized, the voiceprint features of the participants are not required to be acquired before the conference starts, and the workflow and workload before the conference starts are simplified.
By combining the lip language recognition technology and the voice acquisition technology, the accuracy of voice acquisition and voice conversion into text files can be improved to the greatest extent.
Embodiment two:
the system for realizing the method comprises the following steps:
a voice acquisition module configured to: acquiring voice information during a conference and extracting voiceprint features;
a voiceprint recognition module configured to: if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
the voice text conversion module is configured to: and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
Embodiment III:
the present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the multimodal recognition based conference record generating method as described in the above embodiment.
Embodiment four:
the present embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in the method for generating a conference record based on multimodal recognition according to the above embodiment when executing the program.
The steps or modules in the second to fourth embodiments correspond to the first embodiment, and the detailed description of the first embodiment may be referred to in the related description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The conference record generation method based on multi-mode identification is characterized by comprising the following steps of:
acquiring voice information during a conference and extracting voiceprint features;
if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
2. The method for generating a conference record based on multi-modal identification according to claim 1, wherein if the voiceprint feature is in the voiceprint library, determining the identity of the speaker in the voice information and saving the voice information, specifically: comparing the collected voiceprint characteristics with voiceprints stored in a voiceprint library, and if the voiceprint characteristics are in the voiceprint library, determining the identity of a speaker and storing voice information.
3. The conference record generating method based on multi-modal identification as claimed in claim 1, wherein after the speaker's position is determined, lip language identification is performed based on the speaker's facial motion image, and the speaker's identity is determined and its voiceprint characteristics are written into the voiceprint library by comparing with the acquired voice information.
4. The multi-modal recognition-based conference recording generation method of claim 1 wherein performing semantic analysis on the voice information includes loading the voice information and recognizing semantics in the voice information.
5. The method for generating a conference recording based on multimodal recognition according to claim 1, wherein the semantic analysis is performed on the voice information, and further comprising performing lip recognition based on the facial motion image of the speaker to enhance the semantic recognition effect.
6. The method for generating a conference recording based on multi-modal identification as claimed in claim 1, wherein the semantic analysis is performed on the voice information and the identity of the speaker is noted, including the step of noting the identity of the speaker based on voiceprint features in the voice information.
7. The conference recording generation method based on multimodal recognition according to claim 1, wherein the output is a text file including a voice file formed from a video file formed from a speaker face action image and voice information, converted into a text file.
8. A conference record generation system based on multimodal recognition, comprising:
a voice acquisition module configured to: acquiring voice information during a conference and extracting voiceprint features;
a voiceprint recognition module configured to: if the voiceprint features are in the voiceprint library, determining the identity of a speaker in the voice information and storing the voice information; if the voiceprint features are not in the voiceprint library, determining the identity of the speaker according to the facial action image and the voice information of the speaker and writing the voiceprint features into the voiceprint library according to the position of the speaker positioned by the sound source;
the voice text conversion module is configured to: and carrying out semantic analysis on the voice information, marking the identity of a speaker, and outputting the voice information as a text file.
9. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in the multimodal recognition based conference record generating method as claimed in any of the preceding claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the multimodal recognition based conference record generating method as claimed in any one of claims 1 to 7 when the program is executed.
CN202211727305.3A 2022-12-30 2022-12-30 Conference record generation method and system based on multi-mode identification Pending CN116246610A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211727305.3A CN116246610A (en) 2022-12-30 2022-12-30 Conference record generation method and system based on multi-mode identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211727305.3A CN116246610A (en) 2022-12-30 2022-12-30 Conference record generation method and system based on multi-mode identification

Publications (1)

Publication Number Publication Date
CN116246610A true CN116246610A (en) 2023-06-09

Family

ID=86625261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211727305.3A Pending CN116246610A (en) 2022-12-30 2022-12-30 Conference record generation method and system based on multi-mode identification

Country Status (1)

Country Link
CN (1) CN116246610A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612766A (en) * 2023-07-14 2023-08-18 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis
CN117312612A (en) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612766A (en) * 2023-07-14 2023-08-18 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method
CN116612766B (en) * 2023-07-14 2023-11-17 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method
CN117312612A (en) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium
CN117312612B (en) * 2023-10-07 2024-04-02 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis
CN117174092B (en) * 2023-11-02 2024-01-26 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Similar Documents

Publication Publication Date Title
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN109493850B (en) Growing type dialogue device
JP3848319B2 (en) Information processing method and information processing apparatus
CN116246610A (en) Conference record generation method and system based on multi-mode identification
CN110517689B (en) Voice data processing method, device and storage medium
US6434520B1 (en) System and method for indexing and querying audio archives
US7143033B2 (en) Automatic multi-language phonetic transcribing system
CN112037791B (en) Conference summary transcription method, apparatus and storage medium
WO2020043123A1 (en) Named-entity recognition method, named-entity recognition apparatus and device, and medium
CN109686383B (en) Voice analysis method, device and storage medium
US7739110B2 (en) Multimedia data management by speech recognizer annotation
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN111785275A (en) Voice recognition method and device
CN111402892A (en) Conference recording template generation method based on voice recognition
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
CN112925945A (en) Conference summary generation method, device, equipment and storage medium
CN113920986A (en) Conference record generation method, device, equipment and storage medium
CN112507311A (en) High-security identity verification method based on multi-mode feature fusion
US10762375B2 (en) Media management system for video data processing and adaptation data generation
CN111180025A (en) Method and device for representing medical record text vector and inquiry system
CN110782902A (en) Audio data determination method, apparatus, device and medium
US20050125224A1 (en) Method and apparatus for fusion of recognition results from multiple types of data sources
CN113691382A (en) Conference recording method, conference recording device, computer equipment and medium
CN113889081A (en) Speech recognition method, medium, device and computing equipment
CN111933131A (en) Voice recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination