CN113691382A - Conference recording method, conference recording device, computer equipment and medium - Google Patents

Conference recording method, conference recording device, computer equipment and medium Download PDF

Info

Publication number
CN113691382A
CN113691382A CN202110978838.8A CN202110978838A CN113691382A CN 113691382 A CN113691382 A CN 113691382A CN 202110978838 A CN202110978838 A CN 202110978838A CN 113691382 A CN113691382 A CN 113691382A
Authority
CN
China
Prior art keywords
voice
conference
audio
feature
voices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110978838.8A
Other languages
Chinese (zh)
Inventor
何春梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110978838.8A priority Critical patent/CN113691382A/en
Publication of CN113691382A publication Critical patent/CN113691382A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1831Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Hospice & Palliative Care (AREA)
  • General Physics & Mathematics (AREA)
  • Psychiatry (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application is applicable to the technical field of artificial intelligence, and provides a conference recording method, a device, computer equipment and a medium, wherein the method comprises the following steps: carrying out voiceprint recognition and emotion recognition on each conference voice to obtain corresponding voice speakers and speaking emotion characteristics; according to the voice speakers and speaking emotion characteristics corresponding to the conference voices respectively, marking voice texts corresponding to the corresponding conference voices, and writing the marked voice texts into conference records; and determining project information of corresponding conference voice according to the voice text, and performing information marking on the conference record according to the project information. According to the conference voice identification method and device, voiceprint identification and emotion identification are carried out on each conference voice, voice speakers and speaking emotion characteristics of each conference voice are determined, the voice texts are automatically marked based on the voice speakers and the speaking emotion characteristics, the marked voice texts are written into conference records, manual marking of the conference records is not needed, and conference record generation efficiency is improved.

Description

Conference recording method, conference recording device, computer equipment and medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a conference recording method, apparatus, computer device, and medium.
Background
In the existing office meeting process, the function of automatic meeting recording is generally provided, namely, a pickup device is arranged in a meeting room to pick up the sound of meeting personnel, and then the content of the meeting is recorded through the function of converting the voice into characters so as to form meeting recording for the meeting personnel to use.
In the existing conference recording process, only corresponding characters are recorded, and a manual method is needed to be adopted for marking the speaker in the conference recording, so that the conference recording generation efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present application provide a conference recording method, an apparatus, a computer device, and a medium, so as to solve the problem that the conference recording generation efficiency is low because a speaker is usually marked in a manual manner in the existing conference recording process.
A first aspect of an embodiment of the present application provides a conference recording method, including:
collecting voices of conference personnel in a conference room to obtain conference voices, and respectively carrying out voiceprint recognition and emotion recognition on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices;
according to the voice speakers and speaking emotion characteristics corresponding to the conference voices respectively, marking voice texts corresponding to the corresponding conference voices, and writing the marked voice texts into conference records;
and determining project information of corresponding conference voice according to the voice text, and performing information marking on the conference record according to the determined project information, wherein the project information is used for representing the information of the project described by the corresponding conference voice.
Further, the voice print recognition and emotion recognition are respectively performed on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voice, and the method includes the following steps:
acquiring sample entropy characteristics of each conference voice, and performing silence detection according to the sample entropy characteristics;
performing voice filtering on the conference voices according to the silence detection result, and acquiring the audio features of the conference voices after the voice filtering;
determining the speaker type of the corresponding conference voice according to the audio characteristics, wherein the speaker type comprises a single-person speech type and a multi-person speech type, and performing voice separation on the corresponding conference voice according to the audio characteristics and the speaker type to obtain separated audio;
determining voice speakers of corresponding separated audios according to the audio features, and enabling the voice speakers corresponding to the same audio features to form corresponding relations with the separated audios;
and performing feature fusion on the audio features and the sample entropy features to obtain fusion features, and performing emotion classification on the fusion features to obtain the speaking emotion features.
Further, the determining the speaker type of the corresponding conference voice according to the audio feature includes:
and matching the audio features with a pre-stored feature query table to obtain the speaker type, wherein the feature query table stores different preset audio features and corresponding relations between preset feature combinations and corresponding speaker types.
Further, the performing voice separation on the corresponding conference voice according to the audio features and the speaker type to obtain a separated audio includes:
when the speaker type of the conference voice is a multi-person speaker type, acquiring a preset feature combination corresponding to the audio feature in the feature query table, and determining an audio sub-feature corresponding to the preset feature combination;
and respectively determining the voice position of each audio sub-feature in the corresponding conference voice, and performing voice separation on the conference voice according to the voice position to obtain the separated audio corresponding to each audio sub-feature.
Further, the obtaining of the audio features of the voice of each conference after the voice filtering includes:
and respectively extracting one or more combinations of frequency cepstrum coefficients, pitch periods, zero crossing rates, energy root-mean-square coefficients or spectrum flat coefficients of the conference voices after voice filtering to obtain the audio features.
Further, the information marking the meeting record according to the determined project information includes:
performing word segmentation on the voice text to obtain word segmentation vocabularies, and matching each word segmentation vocabulary with a pre-stored item query table, wherein the item query table stores the corresponding relation between a specified vocabulary and corresponding item information;
and if the word segmentation vocabulary is matched with the item query table, carrying out information marking on the corresponding voice text in the conference record by the matched item information.
Further, the marking the voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics respectively corresponding to each conference voice, and writing the marked voice text into the conference record includes:
respectively acquiring voice acquisition time of each conference voice, and sequencing voice texts corresponding to each conference voice according to the voice acquisition time to obtain the conference record;
and in the conference record, aiming at the same conference voice, carrying out speaker marking and emotion marking on a voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics.
A second aspect of an embodiment of the present application provides a conference recording apparatus, including:
the identification unit is used for collecting the voices of conference personnel in the conference room to obtain conference voices, and respectively carrying out voiceprint identification and emotion identification on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices;
the text marking unit is used for marking the voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics respectively corresponding to each conference voice, and writing the marked voice text into the conference record;
and the item marking unit is used for determining item information of the corresponding conference voice according to the voice text and marking the conference record with information according to the determined item information, wherein the item information is used for representing the information of the item described by the corresponding conference voice.
A third aspect of embodiments of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the computer device, where the processor implements the steps of the conference recording method provided in the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the conference recording method provided by the first aspect.
According to the conference recording method, the conference recording device, the computer equipment and the conference recording medium, voiceprint recognition and emotion recognition are carried out on each conference voice to determine voice speakers and speaking emotion characteristics of each conference voice, corresponding voice texts are automatically marked based on the voice speakers and the speaking emotion characteristics, the marked voice texts are written into a conference record, manual marking of the conference record is not needed, and conference record generation efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of an implementation of a conference recording method according to an embodiment of the present application;
fig. 2 is a flowchart of an implementation of a conference recording method according to another embodiment of the present application;
fig. 3 is a block diagram of a structure of a conference recording apparatus according to an embodiment of the present application;
fig. 4 is a block diagram of a computer device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In the embodiment of the application, the conference recording method is realized based on the artificial intelligence technology, and the conference recording is carried out in the conference process.
Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a conference recording method provided in an embodiment of the present application, where the conference recording method is applied to any computer device, where the computer device may be a server, a mobile phone, a tablet, or a wearable smart device, and the conference recording method includes:
step S10, collecting the voices of conference personnel in the conference room to obtain conference voices, and respectively carrying out voiceprint recognition and emotion recognition on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices;
wherein, be provided with pronunciation collection equipment in the meeting room, this pronunciation collection equipment is used for gathering meeting indoor meeting personnel's pronunciation, and is optional, in this embodiment, when the meeting adopted the mode of online meeting to develop, this pronunciation collection equipment still is used for gathering each meeting personnel's of online meeting pronunciation to obtain this meeting pronunciation.
Wherein voiceprint recognition and emotion recognition are performed on each conference voice to determine a voice speaker and a speaking emotional characteristic corresponding to each conference voice, the speaking emotional characteristic describing an emotion of the corresponding voice speaker at the time of uttering the corresponding conference voice, the emotion described by the speaking emotional characteristic including anger, paternity, cynicism, calmness, or excitement.
Step S20, according to the voice speakers and speaking emotion characteristics corresponding to the conference voices respectively, marking the voice texts corresponding to the conference voices and writing the marked voice texts into conference records;
the voice recognition operation of the conference voices can be performed by adopting a voice recognizer or a voice recognition model based on deep learning, so that the text conversion effect of the conference voices is achieved, and voice texts corresponding to the conference voices are obtained.
In the step, the voice texts corresponding to the corresponding conference voices are marked through the voice speakers and the speaking emotion characteristics corresponding to the conference voices respectively, and the marked voice texts are written into the conference records, so that the corresponding relations between the conference voices and the corresponding voice texts, between the voice speakers and between the conference voices and the speaking emotion characteristics are stored in the conference records, manual marking of speaker information is not needed, conference record generation efficiency is improved, and emotions of the speakers during conference can be effectively known based on the speaking emotion characteristics.
Step S30, determining the project information of the corresponding conference voice according to the voice text, and marking the conference record with information according to the determined project information;
the method and the device have the advantages that the project information of the corresponding conference voice is determined through the voice text, and the conference record is subjected to information marking according to the determined project information, so that the method and the device are effectively convenient for the follow-up user to check the corresponding project information of the speech of each speaker when the conference record is checked, and the project information is used for representing the project information described by the corresponding conference voice.
For example, when the determined item information is the item information b1 with respect to the voice text a1 of the conference voice, the voice text a1 in the conference record is marked according to the item information b1, and preferably, the marking method used for marking the voice text with information may be a text mark, a serial number mark, an image mark, or the like, where the text mark, the serial number mark, or the image mark is used for representing the corresponding item information, so that when a conference person or other users not participating in the conference performs the conference record viewing, the information of the item discussed with respect to the voice text of each conference voice can be viewed.
Optionally, in this step, the marking a voice text corresponding to the conference voice according to the voice speaker and the speaking emotion feature respectively corresponding to each conference voice, and writing the marked voice text into the conference record includes:
respectively acquiring voice acquisition time of each conference voice, and sequencing voice texts corresponding to each conference voice according to the voice acquisition time to obtain the conference record;
in the conference record, aiming at the same conference voice, carrying out speaker marking and emotion marking on a voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics;
the voice speakers and the speaking emotion characteristics are used for carrying out speaker marking and emotion marking on the voice texts corresponding to the corresponding conference voices, so that a user viewing the conference records can effectively know the voice speakers corresponding to the voice texts and the emotion of the voice speakers when the voice speakers send the conference voices corresponding to the voice texts based on the conference records.
Optionally, in this step, the information marking of the meeting record according to the determined item information includes:
performing word segmentation on the voice text to obtain word segmentation vocabularies, and matching each word segmentation vocabulary with a pre-stored item query table;
the item query table stores a corresponding relationship between a specified vocabulary and corresponding item information, and the specified vocabulary can be set according to user requirements, for example, the specified vocabulary can be set as an item name of an item related to a current conference;
and if the word segmentation vocabulary is matched with the item query table, carrying out information marking on the corresponding voice text in the conference record by the matched item information.
In the embodiment, the voice speaker and the speaking emotion characteristics of each conference voice are determined by performing voiceprint recognition and emotion recognition on each conference voice, the corresponding voice text is automatically marked based on the determined voice speaker and speaking emotion characteristics, a corresponding conference record is generated, manual marking of the conference record is not needed, and the conference record generation efficiency is improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating an implementation of a conference recording method according to another embodiment of the present application. With respect to the embodiment of fig. 1, the conference recording method provided by this embodiment is used to further refine step S10 in the embodiment of fig. 1, and includes:
step S11, acquiring sample entropy characteristics of each conference voice, and carrying out silence detection according to the sample entropy characteristics;
the Sample Entropy (Sample Entropy) is similar to the physical meaning of the approximate Entropy, and the time series complexity is measured by measuring the probability of generating a new pattern in the signal. The silence detection is carried out through the sample entropy characteristics, the voice starting point and the voice starting point in each conference voice can be accurately identified, and the accuracy of voice filtering on each conference voice in the follow-up process is further improved.
Step S12, carrying out voice filtering on the conference voices according to the silence detection result, and acquiring the audio features of the conference voices after the voice filtering;
the voice filtering is carried out on the conference voices through the voice starting point and the voice starting point obtained from the silence detection result, so that the noise and silence in the conference voices can be effectively removed, and the accuracy of voice signals in the conference voices is improved.
Optionally, in this step, the obtaining of the audio features of the conference voices after the voice filtering includes:
and respectively extracting one or more combinations of frequency cepstrum coefficients, pitch periods, zero crossing rates, energy root-mean-square coefficients or spectrum flat coefficients of the conference voices after voice filtering to obtain the audio features.
Step S13, determining the speaker type of the corresponding conference voice according to the audio characteristics, and performing voice separation on the corresponding conference voice according to the audio characteristics and the speaker type to obtain separated audio;
the speaker type comprises a single speaker type and a multi-person speaker type, when the speaker type of the conference voice is the single speaker type, it is judged that only one speaker exists in the conference voice, and when the speaker type of the conference voice is the multi-person speaker type, it is judged that a plurality of speakers exist in the conference voice.
In the step, when the speaker type of the conference voice is a single-person speaker type, the conference voice is directly set as a separated audio, when the speaker type of the conference voice is a multi-person speaker type, the audio corresponding to each speaker in the conference voice is respectively determined, the conference voice is separated according to an audio determination result, the separated audio is obtained, and one separated audio only comprises voice information of one voice speaker.
Optionally, in this step, the determining, according to the audio feature, a speaker type of the corresponding conference voice includes:
matching the audio features with a pre-stored feature query table to obtain the speaker type;
the preset audio features can be set according to the audio features of the conference personnel participating in the current conference, the preset audio features are obtained by respectively obtaining the audio features of the conference personnel and combining the audio features of different conference personnel, and further, when the audio features of different conference personnel are combined, the number of the combined audio features can be set to be 2, 3 or 4 and the like.
For example, when the conference staff of the current conference includes conference staff c1, conference staff c2, conference staff c3 and conference staff c4, the audio features of the conference staff c1, the conference staff c2, the conference staff c3 and the conference staff c4 are respectively obtained, audio feature d1, audio feature d2 and audio feature d2 are obtained, the audio feature d2 is combined with the audio feature d2 to obtain a preset audio feature e2, the audio feature d2 is combined with the audio feature d2, the preset audio feature d2 is combined with the audio feature d2 to obtain a preset audio feature e2, combining the audio feature d1, the audio feature d2 and the audio feature d3 to obtain a preset audio feature e7, combining the audio feature d1, the audio feature d2 and the audio feature d4 to obtain a preset audio feature e4, combining the audio feature d4, the audio feature d4 and the audio feature d4 to obtain a preset audio feature e4, setting the audio feature d4, the audio feature d4 and the audio feature d4 as the preset audio feature e4, the preset audio feature e4 and the preset audio feature e4 respectively, and setting the audio feature of each conference voice with the preset audio feature e4, the preset audio feature e4 and the preset audio feature e4 respectively, When the audio characteristics of any conference voice are matched with the preset audio characteristics e1, e2, e3, e4, e5, e6, e7, e 3984, e8, e5, e6, e7, e8, e9 or e10, the speaker type of the conference voice is determined to be a multi-person speaker type, and when the audio characteristics of any conference voice are matched with e11, e5, e13 or e14, that is, only one voice speaker in the conference voice is speaking.
Further, in this step, the performing voice separation on the corresponding conference voice according to the audio feature and the speaker type to obtain a separated audio includes:
when the speaker type of the conference voice is a multi-person speaker type, acquiring a preset feature combination corresponding to the audio feature in the feature query table, and determining an audio sub-feature corresponding to the preset feature combination;
when the speaker type of the conference voice is a multi-person speaker type, the audio sub-feature corresponding to the preset feature combination can be effectively queried by acquiring the preset feature combination corresponding to the audio feature in the feature lookup table, and the corresponding voice speaker can be determined based on the audio sub-feature, for example, when the audio feature of the conference voice matches with the preset audio feature e3, the determined audio sub-feature is the audio feature d1 and the audio feature d 4.
Respectively determining the voice position of each audio sub-feature in the corresponding conference voice, and performing voice separation on the conference voice according to the voice position to obtain a separated audio corresponding to each audio sub-feature;
the voice positions of the audio sub-features in the corresponding conference voice are respectively determined, and based on the determined voice positions, the effect of voice separation can be effectively achieved on the conference voice, so that the voice information (separated audio) of the audio sub-features corresponding to the voice speakers can be obtained.
Step S14, determining the voice speakers corresponding to the separated audios according to the audio features, and forming corresponding relations between the voice speakers corresponding to the same audio features and the separated audios;
the corresponding relation between the voice speakers corresponding to the same audio features and the separated audio is formed, so that the corresponding relation between the conference voices and the corresponding voice speakers is effectively determined conveniently.
Step S15, performing feature fusion on the audio features and the sample entropy features to obtain fusion features, and performing emotion classification on the fusion features to obtain the speaking emotion features;
the voice conference system comprises a voice conference system, a sample entropy characteristic acquisition system, a voice conference system and a voice recognition system, wherein the voice conference system is used for acquiring a speech emotion characteristic of each conference voice, the voice emotion characteristic acquisition system is used for acquiring a sample entropy characteristic of the voice conference voice, the sample entropy characteristic acquisition system is used for acquiring a fusion characteristic, emotion classification is performed based on the similarity between the fusion characteristic and a preset emotion characteristic, and the preset emotion characteristic can be set according to requirements and is used for representing the characteristic of a corresponding speech emotion on the voice frequency.
Optionally, in this embodiment, when the speaker type of the conference voice is a multi-person speaker type, the voice text corresponding to each separated audio is recorded in the conference record, and the voice speaker, the speaking emotion feature, and the item information are marked for each separated audio, so that the accuracy of the conference record is improved.
In the embodiment, by acquiring sample entropy characteristics of each conference voice, performing silence detection according to the sample entropy characteristics, based on a voice starting point and a voice starting point obtained from a silence detection result, noise and silence in each conference voice can be effectively removed, accuracy of voice signals in each conference voice is improved, speaker types of corresponding conference voices are determined through audio characteristics, accuracy of voice separation of each conference voice is improved based on speaker types, separated audios corresponding to voice speakers are obtained by performing voice separation on each conference voice, corresponding relations are formed between the voice speakers corresponding to the same audio characteristics and the separated audios, determination of corresponding relations between each conference voice and the corresponding voice speakers is effectively facilitated, and fusion characteristics are obtained by performing characteristic fusion on the audio characteristics and the sample entropy characteristics, and classifying the emotion based on the similarity between the fusion characteristics and the preset emotion characteristics to obtain the speech emotion characteristics of each conference.
Referring to fig. 3, fig. 3 is a block diagram of a conference recording apparatus 100 according to an embodiment of the present disclosure. The conference recording apparatus 100 in this embodiment includes units for executing the steps in the embodiments corresponding to fig. 1 and fig. 2. Please refer to fig. 1 and fig. 2 and the related descriptions in the embodiments corresponding to fig. 1 and fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 3, the conference recording apparatus 100 includes: a recognition unit 10, a text labeling unit 11 and an item labeling unit 12, wherein:
and the recognition unit 10 is configured to collect voices of conference persons in the conference room to obtain conference voices, and perform voiceprint recognition and emotion recognition on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices.
Wherein the identification unit 10 is further configured to: acquiring sample entropy characteristics of each conference voice, and performing silence detection according to the sample entropy characteristics;
performing voice filtering on the conference voices according to the silence detection result, and acquiring the audio features of the conference voices after the voice filtering;
determining the speaker type of the corresponding conference voice according to the audio characteristics, wherein the speaker type comprises a single-person speech type and a multi-person speech type, and performing voice separation on the corresponding conference voice according to the audio characteristics and the speaker type to obtain separated audio;
determining voice speakers of corresponding separated audios according to the audio features, and enabling the voice speakers corresponding to the same audio features to form corresponding relations with the separated audios;
and performing feature fusion on the audio features and the sample entropy features to obtain fusion features, and performing emotion classification on the fusion features to obtain the speaking emotion features.
Optionally, the identification unit 10 is further configured to: and matching the audio features with a pre-stored feature query table to obtain the speaker type, wherein the feature query table stores different preset audio features and corresponding relations between preset feature combinations and corresponding speaker types.
Further, the identification unit 10 is further configured to: when the speaker type of the conference voice is a multi-person speaker type, acquiring a preset feature combination corresponding to the audio feature in the feature query table, and determining an audio sub-feature corresponding to the preset feature combination;
and respectively determining the voice position of each audio sub-feature in the corresponding conference voice, and performing voice separation on the conference voice according to the voice position to obtain the separated audio corresponding to each audio sub-feature.
Further, the identification unit 10 is further configured to: and respectively extracting one or more combinations of frequency cepstrum coefficients, pitch periods, zero crossing rates, energy root-mean-square coefficients or spectrum flat coefficients of the conference voices after voice filtering to obtain the audio features.
And the text marking unit 11 is configured to mark a voice text corresponding to the conference voice according to the voice speaker and the speaking emotion feature respectively corresponding to each conference voice, and write the marked voice text into the conference record.
And an item marking unit 12, configured to determine item information of the corresponding conference voice according to the voice text, and perform information marking on the conference record according to the determined item information, where the item information is used to represent information of an item described by the corresponding conference voice.
Wherein the item tagging unit 12 is further configured to: performing word segmentation on the voice text to obtain word segmentation vocabularies, and matching each word segmentation vocabulary with a pre-stored item query table, wherein the item query table stores the corresponding relation between a specified vocabulary and corresponding item information;
and if the word segmentation vocabulary is matched with the item query table, carrying out information marking on the corresponding voice text in the conference record by the matched item information.
Optionally, the item tagging unit 12 is further configured to: respectively acquiring voice acquisition time of each conference voice, and sequencing voice texts corresponding to each conference voice according to the voice acquisition time to obtain the conference record;
and in the conference record, aiming at the same conference voice, carrying out speaker marking and emotion marking on a voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics.
In the embodiment, the voice speaker and the speaking emotion characteristics of each conference voice are determined by performing voiceprint recognition and emotion recognition on each conference voice, the corresponding voice text is automatically marked based on the voice speaker and the speaking emotion characteristics, the marked voice text is written into the conference record, manual marking of the conference record is not needed, and the conference record generation efficiency is improved.
Fig. 4 is a block diagram of a computer device 2 according to another embodiment of the present application. As shown in fig. 4, the computer device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program of a conference recording method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 22, implements the steps in the embodiments of the conference recording methods described above, such as S10-S30 shown in fig. 1, or S11-S15 shown in fig. 2. Alternatively, when the processor 20 executes the computer program 22, the functions of the units in the embodiment corresponding to fig. 3, for example, the functions of the units 10 to 12 shown in fig. 3, are implemented, for which reference is specifically made to the relevant description in the embodiment corresponding to fig. 3, which is not repeated herein.
Illustratively, the computer program 22 may be divided into one or more units, which are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the computer device 2. For example, the computer program 22 may be divided into a recognition unit 10, a text tagging unit 11 and an item tagging unit 12, each of which functions specifically as described above.
The computer device may include, but is not limited to, a processor 20, a memory 21. Those skilled in the art will appreciate that fig. 4 is merely an example of a computer device 2 and is not intended to limit the computer device 2 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the computer device may also include input output devices, network access devices, buses, etc.
The processor 20 may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. The memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the computer device 2. The memory 21 is used for storing the computer program and other programs and data required by the computer device. The memory 21 may also be used to temporarily store data that has been output or is to be output.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be non-volatile or volatile. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A conference recording method, comprising:
collecting voices of conference personnel in a conference room to obtain conference voices, and respectively carrying out voiceprint recognition and emotion recognition on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices;
according to the voice speakers and speaking emotion characteristics corresponding to the conference voices respectively, marking voice texts corresponding to the corresponding conference voices, and writing the marked voice texts into conference records;
and determining project information of corresponding conference voice according to the voice text, and performing information marking on the conference record according to the determined project information, wherein the project information is used for representing the information of the project described by the corresponding conference voice.
2. The conference recording method according to claim 1, wherein the performing voiceprint recognition and emotion recognition on each conference voice respectively to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voice comprises:
acquiring sample entropy characteristics of each conference voice, and performing silence detection according to the sample entropy characteristics;
performing voice filtering on the conference voices according to the silence detection result, and acquiring the audio features of the conference voices after the voice filtering;
determining the speaker type of the corresponding conference voice according to the audio characteristics, wherein the speaker type comprises a single-person speech type and a multi-person speech type, and performing voice separation on the corresponding conference voice according to the audio characteristics and the speaker type to obtain separated audio;
determining voice speakers of corresponding separated audios according to the audio features, and enabling the voice speakers corresponding to the same audio features to form corresponding relations with the separated audios;
and performing feature fusion on the audio features and the sample entropy features to obtain fusion features, and performing emotion classification on the fusion features to obtain the speaking emotion features.
3. The conference recording method according to claim 2, wherein said determining a speaker type of a corresponding conference voice according to the audio feature comprises:
and matching the audio features with a pre-stored feature query table to obtain the speaker type, wherein the feature query table stores different preset audio features and corresponding relations between preset feature combinations and corresponding speaker types.
4. The conference recording method according to claim 2, wherein said voice-separating the corresponding conference voice according to the audio feature and the speaker type to obtain a separated audio comprises:
when the speaker type of the conference voice is a multi-person speaker type, acquiring a preset feature combination corresponding to the audio feature in the feature query table, and determining an audio sub-feature corresponding to the preset feature combination;
and respectively determining the voice position of each audio sub-feature in the corresponding conference voice, and performing voice separation on the conference voice according to the voice position to obtain the separated audio corresponding to each audio sub-feature.
5. The conference recording method according to claim 2, wherein the obtaining the audio feature of the conference voices after voice filtering comprises:
and respectively extracting one or more combinations of frequency cepstrum coefficients, pitch periods, zero crossing rates, energy root-mean-square coefficients or spectrum flat coefficients of the conference voices after voice filtering to obtain the audio features.
6. The method of claim 1, wherein the information tagging of the meeting record according to the determined item information comprises:
performing word segmentation on the voice text to obtain word segmentation vocabularies, and matching each word segmentation vocabulary with a pre-stored item query table, wherein the item query table stores the corresponding relation between a specified vocabulary and corresponding item information;
and if the word segmentation vocabulary is matched with the item query table, carrying out information marking on the corresponding voice text in the conference record by the matched item information.
7. The conference recording method according to any one of claims 1 to 6, wherein the marking a voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion feature respectively corresponding to each conference voice, and writing the marked voice text into the conference recording comprises:
respectively acquiring voice acquisition time of each conference voice, and sequencing voice texts corresponding to each conference voice according to the voice acquisition time to obtain the conference record;
and in the conference record, aiming at the same conference voice, carrying out speaker marking and emotion marking on a voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics.
8. A conference recording apparatus, comprising:
the identification unit is used for collecting the voices of conference personnel in the conference room to obtain conference voices, and respectively carrying out voiceprint identification and emotion identification on each conference voice to obtain voice speakers and speaking emotion characteristics corresponding to the corresponding conference voices;
the text marking unit is used for marking the voice text corresponding to the corresponding conference voice according to the voice speaker and the speaking emotion characteristics respectively corresponding to each conference voice, and writing the marked voice text into the conference record;
and the item marking unit is used for determining item information of the corresponding conference voice according to the voice text and marking the conference record with information according to the determined item information, wherein the item information is used for representing the information of the item described by the corresponding conference voice.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110978838.8A 2021-08-25 2021-08-25 Conference recording method, conference recording device, computer equipment and medium Pending CN113691382A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110978838.8A CN113691382A (en) 2021-08-25 2021-08-25 Conference recording method, conference recording device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110978838.8A CN113691382A (en) 2021-08-25 2021-08-25 Conference recording method, conference recording device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN113691382A true CN113691382A (en) 2021-11-23

Family

ID=78582285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110978838.8A Pending CN113691382A (en) 2021-08-25 2021-08-25 Conference recording method, conference recording device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN113691382A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291239A (en) * 2008-05-20 2008-10-22 华为技术有限公司 Method and apparatus for enhancing effect of meeting
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109388701A (en) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 Minutes generation method, device, equipment and computer storage medium
CN111243590A (en) * 2020-01-17 2020-06-05 中国平安人寿保险股份有限公司 Conference record generation method and device
CN111666746A (en) * 2020-06-05 2020-09-15 中国银行股份有限公司 Method and device for generating conference summary, electronic equipment and storage medium
WO2020218664A1 (en) * 2019-04-25 2020-10-29 이봉규 Smart conference system based on 5g communication and conference support method using robotic processing automation
CN111933144A (en) * 2020-10-09 2020-11-13 融智通科技(北京)股份有限公司 Conference voice transcription method and device for post-creation of voiceprint and storage medium
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method
CN112489625A (en) * 2020-10-19 2021-03-12 厦门快商通科技股份有限公司 Voice emotion recognition method, system, mobile terminal and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291239A (en) * 2008-05-20 2008-10-22 华为技术有限公司 Method and apparatus for enhancing effect of meeting
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109388701A (en) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 Minutes generation method, device, equipment and computer storage medium
WO2020218664A1 (en) * 2019-04-25 2020-10-29 이봉규 Smart conference system based on 5g communication and conference support method using robotic processing automation
CN111243590A (en) * 2020-01-17 2020-06-05 中国平安人寿保险股份有限公司 Conference record generation method and device
CN111666746A (en) * 2020-06-05 2020-09-15 中国银行股份有限公司 Method and device for generating conference summary, electronic equipment and storage medium
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method
CN111933144A (en) * 2020-10-09 2020-11-13 融智通科技(北京)股份有限公司 Conference voice transcription method and device for post-creation of voiceprint and storage medium
CN112489625A (en) * 2020-10-19 2021-03-12 厦门快商通科技股份有限公司 Voice emotion recognition method, system, mobile terminal and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李公法,蒋国璋,孔建益,江都,陶波: "《机器人灵巧手的人机交互技术及其稳定控制》", 31 July 2020, 华中科技大学出版社, pages: 13 *
王远昌: "《人工智能时代:电子产品设计与制作研究》", 31 January 2019, 成都:电子科技大学出版社, pages: 124 - 125 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment
CN115828907B (en) * 2023-02-16 2023-04-25 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer device

Similar Documents

Publication Publication Date Title
Schuller et al. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates
US10977299B2 (en) Systems and methods for consolidating recorded content
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
US10403282B2 (en) Method and apparatus for providing voice service
CN111785275A (en) Voice recognition method and device
JP2019053126A (en) Growth type interactive device
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN109192194A (en) Voice data mask method, device, computer equipment and storage medium
CN112750442B (en) Crested mill population ecological system monitoring system with wavelet transformation and method thereof
CN111048095A (en) Voice transcription method, equipment and computer readable storage medium
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
US20220238118A1 (en) Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
CN111651497A (en) User label mining method and device, storage medium and electronic equipment
CN111402892A (en) Conference recording template generation method based on voice recognition
CN111145903A (en) Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system
CN109947971A (en) Image search method, device, electronic equipment and storage medium
CN116246610A (en) Conference record generation method and system based on multi-mode identification
Wagner et al. Applying cooperative machine learning to speed up the annotation of social signals in large multi-modal corpora
CN113923521B (en) Video scripting method
KR20170086233A (en) Method for incremental training of acoustic and language model using life speech and image logs
CN113691382A (en) Conference recording method, conference recording device, computer equipment and medium
CN109213970B (en) Method and device for generating notes
JP3664499B2 (en) Voice information processing method and apparatus
CN112687280B (en) Biodiversity monitoring system with frequency spectrum-time space interface

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination