CN111798871B - Session link identification method, device and equipment and storage medium - Google Patents

Session link identification method, device and equipment and storage medium Download PDF

Info

Publication number
CN111798871B
CN111798871B CN202010933549.1A CN202010933549A CN111798871B CN 111798871 B CN111798871 B CN 111798871B CN 202010933549 A CN202010933549 A CN 202010933549A CN 111798871 B CN111798871 B CN 111798871B
Authority
CN
China
Prior art keywords
audio
model
paragraph
speaking
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010933549.1A
Other languages
Chinese (zh)
Other versions
CN111798871A (en
Inventor
魏海巍
万菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gongdao Network Technology Co ltd
Original Assignee
Gongdao Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gongdao Network Technology Co ltd filed Critical Gongdao Network Technology Co ltd
Priority to CN202010933549.1A priority Critical patent/CN111798871B/en
Publication of CN111798871A publication Critical patent/CN111798871A/en
Application granted granted Critical
Publication of CN111798871B publication Critical patent/CN111798871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a conversation link identification method, a conversation link identification device, conversation link identification equipment and a storage medium, which can determine a conversation link corresponding to an audio frequency section in conversation audio. The method comprises the following steps: obtaining speaking content and audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles; determining a target characteristic vector corresponding to the target audio paragraph according to the speaking content and the audio characteristic information; and inputting the target feature vector into a trained conversation link recognition model to obtain a conversation link corresponding to the target audio frequency paragraph output by the conversation link recognition model.

Description

Session link identification method, device and equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for identifying a session, and a storage medium.
Background
In some session scenes, the session process is recorded through the audio and video, so that the whole session process can be known through playing back the audio and video. For example, in judicial judgment, a full-flow video (namely, court trial video) is usually formed in the court trial process, and is matched with a synchronous audio (court trial audio) to record the court trial activities of the whole court trial participants. The court trial audio and video has very important significance for judges, bookkeepers and other court trial participants participating in the court trial, and the court trial audio and video which is allowed to be disclosed is a good case reference for judges, lawyers, legal servers, judicial practitioners and public mediation organizations which do not participate in the court trial, and can play a good role in education and guidance for the general public.
However, in general, the playing time of the audio and video is long, if a viewer wants to watch a certain key link or a link interested by the viewer, the viewer needs to try to find out the corresponding link by pulling the progress bar little by little, the efficiency is low, and if it can be determined in advance which paragraphs in the audio and video correspond to which link, the viewer can be helped to quickly locate the link needing to be watched.
Disclosure of Invention
In view of this, the present invention provides a method, an apparatus, and a device for identifying a session link, which can determine a session link corresponding to an audio segment in a session audio.
The first aspect of the present invention provides a method for identifying a session, including:
obtaining speaking content and audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles;
determining a target characteristic vector corresponding to the target audio paragraph according to the speaking content and the audio characteristic information;
and inputting the target feature vector into a trained conversation link recognition model to obtain a conversation link corresponding to the target audio frequency paragraph output by the conversation link recognition model.
According to an embodiment of the present invention, obtaining the speaking content of the speaking role in a target audio paragraph from the target audio paragraph to be identified in a conversation audio includes:
performing audio recognition on the target audio paragraph to obtain at least one recognized reference sentence;
inputting the reference sentence into a trained error correction model, wherein the error correction model is used for correcting the error content in the reference sentence to obtain a candidate sentence output by the correction model;
determining the utterance content based on the candidate sentence.
According to an embodiment of the present invention, acquiring audio feature information of a speaking role in a target audio paragraph from the target audio paragraph to be identified in a conversation audio includes:
inputting the target audio paragraph to a trained audio feature extractor to obtain audio feature information output by the audio feature extractor; the audio feature extractor at least comprises an extraction layer used for extracting audio features from an input audio paragraph and an embedding layer used for embedding and expressing the audio features and outputting the expressed audio feature information;
and determining the audio characteristic information output by the audio characteristic extractor as the audio characteristic information of the speaking role in the target audio paragraph.
According to an embodiment of the present invention, determining a target feature vector corresponding to the target audio paragraph according to the utterance content and the audio feature information includes:
inputting the speaking content and the audio characteristic information into a trained vector model, and determining and outputting a corresponding characteristic vector based on the input speaking content and the audio characteristic information by the vector model;
and determining the feature vector output by the vector model as the target feature vector.
According to one embodiment of the invention, the vector model is trained by:
obtaining a first set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, a speech in each audio paragraph sample being from a speech role;
acquiring the speaking content and the audio characteristic information of a speaking role in an audio paragraph sample from the audio paragraph sample aiming at each audio paragraph sample in a first audio paragraph sample set, forming the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample into first sample data, and defining the speaking angle color corresponding to the audio paragraph sample as label information corresponding to the first sample data;
and training the vector model by using each first sample data and the corresponding label information.
According to an embodiment of the present invention, training the vector model by using each first sample data and the corresponding label information includes:
establishing a first model and a second model;
selecting at least one first sample data from the first sample data, inputting the selected first sample data into the first model, extracting a feature vector from the input first sample data by the first model, and outputting the feature vector to the second model, wherein the second model predicts and outputs a speaking role based on the feature vector output by the first model;
optimizing the first model according to the label information corresponding to the selected first sample data and the speaking role output by the second model; and when the training completion condition is not met currently, returning to the step of selecting at least one first sample data from the first sample data, and when the training completion condition is met currently, determining the first model as the vector model.
According to one embodiment of the present invention, the session identification model is obtained by training in the following manner:
obtaining a second set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, each audio paragraph sample having speech from one or more speech roles and each audio paragraph sample corresponding to a conversational link;
for each audio paragraph sample in a second audio paragraph sample set, obtaining the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample from the audio paragraph sample, inputting the speaking content and the audio characteristic information into the vector model to obtain a characteristic vector output by the vector model as second sample data, and calibrating a session ring section corresponding to the audio paragraph sample as a class label corresponding to the second sample data;
and training the session link identification model by using each second sample data and the corresponding class label.
A second aspect of the present invention provides a session identification apparatus, including:
the information acquisition module is used for acquiring the speaking content and the audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles;
a target feature vector determining module, configured to determine a target feature vector corresponding to the target audio paragraph according to the speech content and the audio feature information;
and the session link identification module is used for inputting the target characteristic vector to a trained session link identification model so as to obtain a session link corresponding to the target audio paragraph output by the session link identification model.
According to an embodiment of the present invention, when the information obtaining module obtains the speech content of the speech role in the target audio paragraph from the target audio paragraph to be identified in the conversation audio, the information obtaining module is specifically configured to:
performing audio recognition on the target audio paragraph to obtain at least one recognized reference sentence;
inputting the reference sentence into a trained error correction model, wherein the error correction model is used for correcting the error content in the reference sentence to obtain a candidate sentence output by the correction model;
determining the utterance content based on the candidate sentence.
According to an embodiment of the present invention, when the information obtaining module obtains the audio feature information of the speaking role in the target audio paragraph from the target audio paragraph to be identified in the conversation audio, the information obtaining module is specifically configured to:
inputting the target audio paragraph to a trained audio feature extractor to obtain audio feature information output by the audio feature extractor; the audio feature extractor at least comprises an extraction layer used for extracting audio features from an input audio paragraph and an embedding layer used for embedding and expressing the audio features and outputting the expressed audio feature information;
and determining the audio characteristic information output by the audio characteristic extractor as the audio characteristic information of the speaking role in the target audio paragraph.
According to an embodiment of the present invention, when the target feature vector determining module determines the target feature vector corresponding to the target audio paragraph according to the speech content and the audio feature information, the target feature vector determining module is specifically configured to:
inputting the speaking content and the audio characteristic information into a trained vector model, and determining and outputting a corresponding characteristic vector based on the input speaking content and the audio characteristic information by the vector model;
and determining the feature vector output by the vector model as the target feature vector.
According to one embodiment of the present invention, the vector model is trained by:
a first set obtaining module, configured to obtain a first audio paragraph sample set, where the first audio paragraph sample set includes audio paragraph samples divided from a plurality of conversation audios, a speech in each audio paragraph sample is from a speech role, and an audio paragraph sample corresponds to the speech role;
a first sample data obtaining module, configured to obtain, for each audio paragraph sample in a first audio paragraph sample set, utterance content and audio feature information of an utterance role in the audio paragraph sample from the audio paragraph sample, combine the utterance content and the audio feature information of the utterance role in the audio paragraph sample into first sample data, and determine an utterance angle code corresponding to the audio paragraph sample as tag information corresponding to the first sample data;
and the vector model training module is used for training the vector model by utilizing each first sample data and the corresponding label information.
According to an embodiment of the present invention, when the vector model training module trains the vector model by using each first sample data and the corresponding label information, the vector model training module is specifically configured to:
establishing a first model and a second model;
selecting at least one first sample data from the first sample data, inputting the selected first sample data into the first model, extracting a feature vector from the input first sample data by the first model, and outputting the feature vector to the second model, wherein the second model predicts and outputs a speaking role based on the feature vector output by the first model;
optimizing the first model according to the label information corresponding to the selected first sample data and the speaking role output by the second model; and when the training completion condition is not met currently, returning to the step of selecting at least one first sample data from the first sample data, and when the training completion condition is met currently, determining the first model as the vector model.
According to one embodiment of the invention, the session identification model is obtained by training the following modules:
a second set obtaining module, configured to obtain a second audio paragraph sample set, where the second audio paragraph sample set includes audio paragraph samples divided from a plurality of conversation audios, where a speech in each audio paragraph sample is from one or more speech roles, and each audio paragraph sample corresponds to a conversation link;
a second sample data obtaining module, configured to obtain, for each audio paragraph sample in a second audio paragraph sample set, speech content and audio feature information of a speech role in the audio paragraph sample from the audio paragraph sample, input the speech content and the audio feature information into the vector model, so as to obtain a feature vector output by the vector model as second sample data, and calibrate a session ring section corresponding to the audio paragraph sample as a category label corresponding to the second sample data;
and the session link identification model training module is used for training the session link identification model by using each second sample data and the corresponding class label.
A third aspect of the invention provides an electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; when the processor executes the program, the session identification method in the foregoing embodiment is implemented.
A fourth aspect of the present invention provides a machine-readable storage medium on which a program is stored, the program, when executed by a processor, implementing the session identification method as described in the foregoing embodiments.
The embodiment of the invention has the following beneficial effects:
in the embodiment of the invention, the corresponding target characteristic vector is determined based on the obtained speech content and audio characteristic information of the speech role in the target audio paragraph, because the target characteristic vector is the vector representation integrating the speech content and the audio characteristic information, the speech content can represent which speech is specifically spoken by the speech role in the target audio paragraph, the audio characteristic information can represent which speech role is specifically spoken in the target audio paragraph, and actually which speech role is spoken by which speech role can determine which conversation link the speech occurs in, based on the characteristic, a conversation link identification model can be trained in advance, after the target characteristic vector is input to the trained conversation link identification model, the conversation link corresponding to the target audio paragraph can be identified by the conversation link identification model, so that a viewer can quickly know which audio paragraphs correspond to which conversation links, the method does not need to search and distinguish manually, is beneficial to a viewer to quickly locate the conversation link needing to be viewed, and improves the viewing experience of the viewer.
Drawings
Fig. 1 is a schematic flow chart of a session identification method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the division of audio segments in conversational audio according to an embodiment of the invention;
fig. 3 is a block diagram of a session identification apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one type of device from another. For example, a first device may also be referred to as a second device, and similarly, a second device may also be referred to as a first device, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In order to make the description of the present invention clearer and more concise, some technical terms in the present invention are explained below:
NLP: natural Language Processing, which is a sub-field of computer science, information engineering and artificial intelligence, focuses on human-computer interaction, especially Processing and analysis of large-scale Natural Language data, and is a study on how to make a computer understand human Language, understand the meaning of Natural Language text by computer mechanism, and express given deep intentions, ideas and the like by Natural Language text.
Transformer: a deep learning model integrated with a self-attention mechanism can learn the context relationship between words (or sub-words) in a text.
And (5) Bert: the model is called as Bidirectional Encoder reproduction from transformations, an NLP pre-training model, in particular to a Bidirectional language model based on transformations.
VGGish: the model obtained by using a large audio data set AudioSet pre-training, wherein "VGG" represents Oxford Visual Geometry Group of Oxford university, and supports extraction of 128-dimensional embedding feature vectors with semantics from audio waveforms.
The following describes the session identification method according to the embodiment of the present invention more specifically, but not limited thereto.
In one embodiment, referring to fig. 1, a session identification method applied to an electronic device may include the following steps:
s100: obtaining speaking content and audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles;
s200: determining a target characteristic vector corresponding to the target audio paragraph according to the speaking content and the audio characteristic information;
s300: and inputting the target feature vector into a trained conversation link recognition model to obtain a conversation link corresponding to the target audio frequency paragraph output by the conversation link recognition model.
In the embodiment of the present invention, the execution subject of the session identification method may be an electronic device. The electronic device may be, for example, a computer device or an embedded device. Of course, the specific type of the electronic device is not limited, and the electronic device may have a certain processing capability.
The embodiment of the invention can be applied to the playing of the conversation audio and/or the conversation video, can be used for identifying the conversation link corresponding to the audio paragraph in the conversation audio, can determine the conversation link corresponding to the corresponding video paragraph under the condition that the conversation audio and the conversation video are synchronously played, and can be helpful for a viewer to quickly locate a certain key conversation link or a link which is interested by the viewer.
For example, the embodiments of the present invention may be applied to playing of court trial audio and/or court trial video, and may identify which audio segment corresponds to which court trial link, so as to facilitate the viewer to locate the required court trial link.
Before the step S100 is executed, the voice frequency may be divided according to a set audio paragraph dividing manner to obtain a plurality of audio paragraphs.
Optionally, the speech in each divided audio segment is from the same speech role, and the speech in two adjacent audio segments is from different speech roles, for example, three speech roles a1, a2, and A3 speak successively in the conversation audio, and three audio segments are correspondingly divided, where the speech in the three audio segments is from speech roles a1, a2, and A3, and the three audio segments do not have overlapping portions.
Alternatively, still according to the speech roles, but at each division, the speech in each divided audio segment comes from different speech roles, and there is an overlapping part between two adjacent audio segments, for example, referring to fig. 2, three speech roles a1, a2 and A3 speak successively in the conversation audio, and three audio segments are correspondingly divided, the speech in one audio segment B1 comes from speech roles a1 and a2 (the partial speech in one complete speech including a1 and a 2), the speech in one audio segment B2 comes from speech roles a1, a2 and A3 (the partial speech in one complete speech including a2 and a1, A3), and the speech in one audio segment B3 comes from speech roles a2 and A3 (the speech role including A3 and a 2).
Here, only three speaking roles of a1, a2 and A3 are exemplified, and actually, more speaking roles can be provided. For example, in a court trial scenario, the speaking roles may include a judge, a reporting party, a reporting agent, a reported party, a reported agent, etc., and these speaking roles may speak multiple times.
It is to be understood that the above-mentioned division is only an example, and there may be other division, for example, the voice may be divided into a plurality of audio segments with the same duration, and the like, and the present invention is not limited thereto.
The divided audio segments may be saved, and a timestamp corresponding to the audio segment may be saved, where the timestamp may be a starting time of the corresponding audio segment in the conversation audio.
In step S100, from a target audio paragraph to be identified in a conversation audio, speech content and audio feature information of a speech role in the target audio paragraph are acquired, where the speech of the target audio paragraph is from one or more speech roles.
The target audio segment may be any audio segment divided from the conversation audio, wherein the speech may be from one or more speech roles, such as a complete speech for one speech role, or a complete speech for one speech role and a part of other speech roles. Preferably, the speech in the target audio passage is from 1 or 2 or 3 speech roles.
For example, in a court trial scenario, the speaking role in the session may include a judge, an original speaker party, an original speaker agent, an advertised party, and an advertised agent, but of course, the speaking role in the session is not limited thereto, and may also include a witness, a co-auditor, a bookmarker, and so on.
The utterance content of the target audio segment is obtained from the target audio segment, for example, the utterance content may be obtained from the target audio segment by means of audio Recognition (or Speech Recognition, ASR). The content of the utterance may be represented in text or may be represented in other forms.
In one embodiment, in step S100, obtaining the speaking content of the speaking role in the target audio paragraph from the target audio paragraph to be identified in the conversation audio may include the following steps:
s101: performing audio recognition on the target audio paragraph to obtain at least one recognized reference sentence;
s102: inputting the reference sentence into a trained error correction model, wherein the error correction model is used for correcting the error content in the reference sentence to obtain a candidate sentence output by the correction model;
s103: determining the utterance content based on the candidate sentence.
The audio recognition is a technology for converting human voice into text, and may be implemented by using a deep learning algorithm, which is not limited specifically. When the voice frequency is identified, if the speech only comprises a sentence, a reference sentence can be obtained; if the speech contains a plurality of sentences, the speech can be automatically divided into a plurality of reference sentences, and one sentence corresponds to one reference sentence.
Although the current audio recognition technology is mature, the problems of sentence break, sentence sickness, wrong words and the like cannot be avoided, so that the reference sentence can be corrected after being obtained.
The error correction process may be implemented by a trained error correction model, which is pre-trained and stored in the electronic device or other device and invoked when needed. The error correction model can be obtained by training some sentence samples with known error content, and details are not repeated.
And inputting the reference sentences into the error correction model so as to correct errors in the reference sentences, including broken sentences, sick sentences, wrong words and the like by the error correction model to obtain candidate sentences.
When the utterance content is determined based on the candidate sentences output by the correction model, the candidate sentences output by the correction model can be directly combined into the utterance content; alternatively, when the length of the following sentence is limited, the candidate sentence having a length exceeding the set length may be divided into a plurality of sentences having a length not exceeding the set length, and the divided sentences and the candidate sentences having a length not exceeding the set length may be combined into the utterance contents, for example, when the length of the sentence is required to be limited to 512 words, if the length of the candidate sentence exceeds 512, the candidate sentence may be cut to obtain at least two sentences having a length not exceeding 512, and the utterance contents include a plurality of sentences having a length not exceeding 512.
It is to be understood that the above-mentioned manner of determining the utterance content is only a preferable manner, and the present invention is not limited to this, and for example, when the requirement for accuracy is not high, the reference sentence obtained by audio recognition may be directly composed as the utterance content, or when there is a sentence length limitation, the sentence with an excessive length in the reference sentence may be truncated and composed as the utterance content.
In addition to obtaining speech content from the target audio passage, audio feature information of a speech character in the target audio passage is also obtained from the target audio passage, and the audio feature information may characterize a voiceprint of the speech character in the target audio passage. The voiceprint not only has specificity, but also has the characteristic of relative stability, and the speaking role from which the speech comes can be determined by the voiceprint, so the audio characteristic information can also represent the speaking role in the target audio paragraph.
In one embodiment, in step S100, obtaining audio feature information of a speaking character in a target audio paragraph to be identified in the conversation audio, may include the following steps:
s104: inputting the target audio paragraph to a trained audio feature extractor to obtain audio feature information output by the audio feature extractor; the audio feature extractor at least comprises an extraction layer used for extracting audio features from an input audio paragraph and an embedding layer used for embedding and expressing the audio features and outputting the expressed audio feature information;
s105: and determining the audio characteristic information output by the audio characteristic extractor as the audio characteristic information of the speaking role in the target audio paragraph.
Steps S104 and S105 may be executed before steps S101 to S103, or after steps S101 and S103, or both, and the specific sequence is not limited.
The audio feature extractor is pre-trained and may be stored in the electronic device or other device and invoked when needed.
The audio feature extractor may include an extraction layer for extracting audio features from an input audio passage, and an embedding layer for embedding and expressing the audio features and outputting the expressed audio feature information. The decimation layer may be composed of a plurality of processing sublayers, and may include, for example, a sampling layer, a short-time Fourier transform (STFT) layer, a filter layer, and the like, which are not limited in particular. The embedding layer can perform embedded expression on the audio features and output the audio feature information of the expression, wherein the embedded expression refers to converting (reducing dimensions) data into feature representations (vectors) with fixed sizes so as to facilitate processing and calculation (such as distance calculation), for example, converting the audio features into 128-dimensional embedding feature vectors with semantics.
In one example, a VGGish model may be used as the audio feature extractor, although not particularly limited thereto.
And inputting the target audio paragraph into an audio feature extractor, wherein the audio feature extractor can extract the audio feature of the target audio paragraph and perform embedded expression on the audio feature to obtain an embedding feature vector with semantics as audio feature information.
The audio characteristic information can represent the speaking role and can be used as an additional characteristic of speaking content, and information with semantic representation is provided for identification of a conversation link.
In step S200, a target feature vector corresponding to the target audio paragraph is determined according to the speaking content and the audio feature information.
The speaking content and the audio characteristic information can be processed to obtain a vector suitable for machine learning model input, namely a target characteristic vector. That is, the utterance content and the audio feature information are integrated and expressed by a target feature vector.
The target feature vector is a multi-dimensional vector, and the specific dimension is not limited. The data in each dimension in the target feature vector may have a certain value range, for example, the data in each dimension may be in a range of 0 to 1 (of course, this range is only an example, and may also be other ranges), and the complexity of the machine learning model calculation may be reduced.
The method for determining the target feature vector is not limited, for example, the speech content and the audio feature information may be fused together after vectorization to obtain the target feature vector, and of course, the specific method is not limited thereto.
In one embodiment, in step S200, determining a target feature vector corresponding to the target audio paragraph according to the speaking content and the audio feature information may include the following steps:
s201: inputting the speaking content and the audio characteristic information into a trained vector model, and determining and outputting a corresponding characteristic vector based on the input speaking content and the audio characteristic information by the vector model;
s202: and determining the feature vector output by the vector model as the target feature vector.
The vector model respectively extracts the features of the input speech content and the input audio feature information to obtain a speech content feature vector and an audio feature vector, and the speech content feature vector and the audio feature vector are fused to obtain a corresponding feature vector to be output.
The fusing the speech content feature vector and the audio feature vector may include: splicing the speech content feature vector and the audio feature vector to obtain a corresponding feature vector; for example, the speech content feature vector is a 512-dimensional vector, the audio feature vector is a 128-dimensional vector, and the two vectors are spliced to obtain a corresponding feature vector.
Or, fusing the speech content feature vector and the audio feature vector may include: under the condition that the dimensionalities of the speech content feature vector and the audio feature vector are the same, summing the speech content feature vector and the audio feature vector to obtain corresponding feature vectors; under the condition that the dimensionalities of the speech content characteristic vector and the audio characteristic vector are different, the vector with the smaller dimensionality in the speech content characteristic vector and the audio characteristic vector can be expanded into the dimensionality which is consistent with the dimensionality of the other vector, and after the expansion, the two vectors are summed to obtain the corresponding characteristic vector.
Of course, the above fusion method is only an example, and there may be other methods actually, for example, the summation may also be weighted summation, and the weight coefficient of the speech content feature vector may be larger, which is not limited specifically.
The feature vector output by the vector model is determined as a target feature vector, and the target feature vector is determined based on the speaking content and the audio feature information, so that the target feature vector is a multi-modal feature vector, can realize the complementation of information between the speaking content and the audio feature and is used for representing the speaking content of the corresponding speaking role.
In step S300, the target feature vector is input to the trained session identification model to obtain a session corresponding to the target audio segment output by the session identification model.
Taking a session scene as an example, the whole court trial process may include the following court trial links: the court discipline is announced, the identity of the party is checked, the original quilt is told and the opinions are distinguished, the proof is raised, the court debate is discussed, and the like, and other court trial links can be added according to the actual situation. These links may be further divided, and are not particularly limited. The related session scene may be a judicial scene of a real case, or may be a judicial scene constructed by a law modeling expert in a crowdsourcing manner, and is not limited specifically.
The conversation link recognition model is trained in advance, stored in the electronic equipment or other equipment and called when needed.
The conversation link identification model is used for identifying a conversation link corresponding to the audio paragraph, namely the conversation link to which the speech in the audio paragraph belongs. After the target feature vector is input into the conversation link identification model, the conversation link identification model can identify and output a conversation link corresponding to the target audio paragraph. For example, in the target audio passage, the evidence that the agent of the advertiser is proposing to the agent of the original advertiser is distinguished by the authenticity, the legality, the relevance and the existence and the size of the proof, and then the session link corresponding to the target audio passage is the proof link.
In the embodiment of the invention, the corresponding target characteristic vector is determined based on the obtained speech content and audio characteristic information of the speech role in the target audio paragraph, because the target characteristic vector is the vector representation integrating the speech content and the audio characteristic information, the speech content can represent which speech is specifically spoken by the speech role in the target audio paragraph, the audio characteristic information can represent which speech role is specifically spoken in the target audio paragraph, and actually which speech role is spoken by which speech role can determine which conversation link the speech occurs in, based on the characteristic, a conversation link identification model can be trained in advance, after the target characteristic vector is input to the trained conversation link identification model, the conversation link corresponding to the target audio paragraph can be identified by the conversation link identification model, so that a viewer can quickly know which audio paragraphs correspond to which conversation links, the method does not need to search and distinguish manually, is beneficial to a viewer to quickly locate the conversation link needing to be viewed, and improves the viewing experience of the viewer.
Optionally, after determining the session links corresponding to the audio segments in the session audio, the start time of each session link in the session audio may be further determined based on the timestamps corresponding to the audio segments, that is, each session link in the session audio is located. For example, for an audio segment corresponding to each session link in the session audio, the earliest timestamp corresponding to the timestamp corresponding to each audio segment is respectively used as the start time of the session link. Of course, it may also be determined to determine the ending time of each session link in the session audio based on the timestamp corresponding to each audio paragraph, which is not described in detail again.
Alternatively, the start time and/or the end time of the session element may be applied to the session video synchronized with the conversation audio, that is, the start time and/or the end time of each session element in the session video are the same as those in the conversation audio.
Optionally, after positioning each session link in the session audio, the session information to be displayed may be supplemented with the related information of the audio paragraph corresponding to the session link, for example, it may be determined whether the told party has a valid court answer or not according to the speaking role corresponding to each audio paragraph in a certain session link, so as to enrich the session information.
In one embodiment, the vector model is trained by:
t101: obtaining a first set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, a speech in each audio paragraph sample being from a speech role;
t102: acquiring the speaking content and the audio characteristic information of a speaking role in an audio paragraph sample from the audio paragraph sample aiming at each audio paragraph sample in a first audio paragraph sample set, forming the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample into first sample data, and defining the speaking angle color corresponding to the audio paragraph sample as label information corresponding to the first sample data;
t103: and training the vector model by using each first sample data and the corresponding label information.
The multiple conversation audios can be acquired in a real judicial scene or a simulated judicial scene, and for each conversation audio, the conversation audio can be divided according to a speaking role to obtain multiple audio paragraph samples, and the speaking in each audio paragraph sample is from one speaking role. Preferably, the utterance in each sample of audio segments is a complete utterance of one utterance character (i.e., no intervening other utterances and some time before and after either other utterance characters are speaking or no utterances by other utterances characters).
In this embodiment, the audio paragraph samples with utterances from the same utterance role are adopted instead of the audio paragraph samples from multiple utterance roles, which is more beneficial to the vector model to correctly express the vector, and is beneficial to improving the accuracy of the vector model output when the vector model is used.
After dividing the audio paragraph samples, the audio paragraph samples may be stored in correspondence with the speaking roles in which speaking is performed, for each audio paragraph sample in the first audio paragraph sample set, after obtaining the speaking content and the audio feature information of the audio paragraph sample, the speaking content and the audio feature information of the speaking role in the audio paragraph sample are combined into first sample data, and the speaking angle code corresponding to the audio paragraph sample is determined as the tag information corresponding to the first sample data. The first sample data and the corresponding tag information may be expressed as < speech role, speech content + audio feature information >, for example.
The manner of obtaining the speech content and the audio feature information in the audio paragraph sample may be the same as the manner of obtaining the speech content and the audio feature information in the target audio paragraph in the foregoing embodiment, and details are not repeated here.
And training the vector model by using each first sample data and the corresponding label information. The first sample data may be used as input data, label information corresponding to the first sample data, i.e., a speaking role, may be used as monitoring information, and the vector model may be obtained through supervised training.
In one embodiment, the training of the vector model in step T103 by using each first sample data and the corresponding label information may include the following steps:
establishing a first model and a second model;
selecting at least one first sample data from the first sample data, inputting the selected first sample data into the first model, extracting a feature vector from the input first sample data by the first model, and outputting the feature vector to the second model, wherein the second model predicts and outputs a speaking role based on the feature vector output by the first model;
optimizing the first model according to the label information corresponding to the selected first sample data and the speaking role output by the second model; and when the training completion condition is not met currently, returning to the step of selecting at least one first sample data from the first sample data, and when the training completion condition is met currently, determining the first model as the vector model.
In one example, the first model may employ a bert model, and the second model may employ a classifier, although not specifically limited thereto.
After the first sample data is input to the first model, the first model may perform feature extraction on speech content and audio feature information in the input first sample data to obtain a speech content feature vector and an audio feature vector, and fuse the speech content feature vector and the audio feature vector to obtain a corresponding feature vector, and then output the feature vector to the second model, and the second model may predict a speech role based on the feature vector output by the first model and output the predicted speech role.
Optimizing the first model according to the tag information corresponding to the selected first sample data and the speaking role output by the second model, for example, the method may include: and optimizing the first model, specifically optimizing network parameters in the first model according to the difference between the label information corresponding to the first sample data and the speaking role output by the second model.
Of course, when the first model is optimized, the second model may also be optimized at the same time, and when training, the first model and the second model may use the same loss function, or may use different loss functions for optimization.
After optimizing the first model, it may be checked whether training completion conditions are currently met, such as: whether first sample data which are not selected exist currently can be checked, if yes, the training completion condition is not met currently, and if not, the training completion condition is met currently; or, whether the current training frequency reaches a preset training frequency or not can be checked, if not, the training completion condition is not met currently, and if so, the training completion condition is met currently; alternatively, it may be checked whether the performance of the first model meets a specified requirement, for example, whether the accuracy rate reaches 97%, if not, the training completion condition is not currently met, and if so, the training completion condition is currently met.
Optionally, in order to verify the performance of the vector model, a plurality of first sample data may be obtained for verification, and the first sample data may be different from the first sample data used for training. Alternatively, the ratio of the number of first sample data used for verification to the number of first sample data used for training may be 3: 7. The verification mode can be that the model is verified once every several times in the training process, the verification result is not used for optimizing the model, and the model can be supervised by training personnel in the training process or the training personnel can confirm whether the model training reaches the standard after the training is finished.
Optionally, a plurality of first sample data may be obtained to perform a test, after the training is completed, the trained vector model is tested by using the first sample data for test, and the test result may be used to determine the accuracy of the output result of the vector model, for example, so that a tester can know the performance of the vector model obtained by the training.
In one embodiment, the session identification model is trained by:
t201: obtaining a second set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, each audio paragraph sample having speech from one or more speech roles and each audio paragraph sample corresponding to a conversational link;
t202: for each audio paragraph sample in a second audio paragraph sample set, obtaining the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample from the audio paragraph sample, inputting the speaking content and the audio characteristic information into the vector model to obtain a characteristic vector output by the vector model as second sample data, and calibrating a session ring section corresponding to the audio paragraph sample as a class label corresponding to the second sample data;
t203: and training the session link identification model by using each second sample data and the corresponding class label.
In this embodiment, the utterance in each audio paragraph sample in the second set of audio paragraph samples may be from one or more utterance roles. Preferably, speech in at least one audio paragraph sample of the second set of audio paragraph samples is from a plurality of speech roles.
Under the condition that the speech in the audio paragraph sample is from a plurality of speech roles, because a plurality of people in the speech have conversations, the contextual relevance is stronger, the method is more beneficial to the session link recognition model to carry out context understanding so as to recognize the session link, and the method is also beneficial to the learning and expression of the session link recognition model.
Thus, unlike the training vector model, in training the session identification model, it is preferred to train with a second set of audio paragraph samples from a plurality of speech roles that contain speech in at least one audio paragraph sample.
The session identification model may be trained after the vector model is trained. When the session link identification model is trained, the speech content and the audio feature information of the speech role in the acquired audio paragraph sample can be input into the trained vector model by means of the vector model, so that the feature vector corresponding to each audio paragraph sample is obtained, and the corresponding class label is calibrated for each feature vector, wherein the class label is used for indicating the session link corresponding to the audio paragraph sample, so that the second sample data is obtained.
Training the session identification model by using each second sample data and the corresponding class label, which may include:
establishing a third model;
selecting at least one second sample data from each second sample data, and inputting the second sample data into the third model so that the third model predicts a corresponding session link according to the feature vector in the second sample data;
optimizing the third model according to the class label corresponding to the selected second sample data and the session link output by the third model; and when the training completion condition is not met currently, returning to the step of selecting at least one second sample data from the second sample data, and when the training completion condition is met currently, determining the third model as the session link identification model.
The third model may be a multi-classification model with vector input, and is not limited specifically.
The training mode of the session identification model is similar to that of the vector model, and reference may be made to the description in the foregoing embodiments, which is not repeated herein.
The present invention also provides a session identification apparatus, and referring to fig. 3, the session identification apparatus 100 may include:
the information acquisition module 101 is configured to acquire speech content and audio feature information of a speech role in a target audio paragraph to be identified in a conversation audio, where the speech of the target audio paragraph is from one or more speech roles;
a target feature vector determining module 102, configured to determine a target feature vector corresponding to the target audio paragraph according to the speech content and the audio feature information;
a session link identification module 103, configured to input the target feature vector to a trained session link identification model, so as to obtain a session link corresponding to the target audio segment output by the session link identification model.
In an embodiment, when the information obtaining module obtains the speech content of the speech role in the target audio paragraph from the target audio paragraph to be identified in the conversation audio, the information obtaining module is specifically configured to:
performing audio recognition on the target audio paragraph to obtain at least one recognized reference sentence;
inputting the reference sentence into a trained error correction model, wherein the error correction model is used for correcting the error content in the reference sentence to obtain a candidate sentence output by the correction model;
determining the utterance content based on the candidate sentence.
In an embodiment, when the information obtaining module obtains the audio feature information of the speaking role in the target audio paragraph from the target audio paragraph to be identified in the conversation audio, the information obtaining module is specifically configured to:
inputting the target audio paragraph to a trained audio feature extractor to obtain audio feature information output by the audio feature extractor; the audio feature extractor at least comprises an extraction layer used for extracting audio features from an input audio paragraph and an embedding layer used for embedding and expressing the audio features and outputting the expressed audio feature information;
and determining the audio characteristic information output by the audio characteristic extractor as the audio characteristic information of the speaking role in the target audio paragraph.
In an embodiment, when the target feature vector determining module determines the target feature vector corresponding to the target audio paragraph according to the speech content and the audio feature information, the target feature vector determining module is specifically configured to:
inputting the speaking content and the audio characteristic information into a trained vector model, and determining and outputting a corresponding characteristic vector based on the input speaking content and the audio characteristic information by the vector model;
and determining the feature vector output by the vector model as the target feature vector.
In one embodiment, the vector model is trained by:
a first set obtaining module, configured to obtain a first audio paragraph sample set, where the first audio paragraph sample set includes audio paragraph samples divided from a plurality of conversation audios, and a speech in each audio paragraph sample is from a speech role;
a first sample data obtaining module, configured to obtain, for each audio paragraph sample in a first audio paragraph sample set, utterance content and audio feature information of an utterance role in the audio paragraph sample from the audio paragraph sample, combine the utterance content and the audio feature information of the utterance role in the audio paragraph sample into first sample data, and determine an utterance angle code corresponding to the audio paragraph sample as tag information corresponding to the first sample data;
and the vector model training module is used for training the vector model by utilizing each first sample data and the corresponding label information.
In an embodiment, when the vector model training module trains the vector model by using each first sample data and the corresponding label information, the vector model training module is specifically configured to:
establishing a first model and a second model;
selecting at least one first sample data from the first sample data, inputting the selected first sample data into the first model, extracting a feature vector from the input first sample data by the first model, and outputting the feature vector to the second model, wherein the second model predicts and outputs a speaking role based on the feature vector output by the first model;
optimizing the first model according to the label information corresponding to the selected first sample data and the speaking role output by the second model; and when the training completion condition is not met currently, returning to the step of selecting at least one first sample data from the first sample data, and when the training completion condition is met currently, determining the first model as the vector model.
In one embodiment, the session identification model is trained by the following modules:
a second set obtaining module, configured to obtain a second audio paragraph sample set, where the second audio paragraph sample set includes audio paragraph samples divided from a plurality of conversation audios, where a speech in each audio paragraph sample is from one or more speech roles, and each audio paragraph sample corresponds to a conversation link;
a second sample data obtaining module, configured to obtain, for each audio paragraph sample in a second audio paragraph sample set, speech content and audio feature information of a speech role in the audio paragraph sample from the audio paragraph sample, input the speech content and the audio feature information into the vector model, so as to obtain a feature vector output by the vector model as second sample data, and calibrate a session ring section corresponding to the audio paragraph sample as a category label corresponding to the second sample data;
and the session link identification model training module is used for training the session link identification model by using each second sample data and the corresponding class label.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units.
The invention also provides an electronic device, which comprises a processor and a memory; the memory stores a program that can be called by the processor; when the processor executes the program, the session identification method in the foregoing embodiment is implemented.
The embodiment of the session link identification device can be applied to electronic equipment. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 4, fig. 4 is a hardware structure diagram of an electronic device where the session identification apparatus 100 is located according to an exemplary embodiment of the present invention, except for the processor 510, the memory 530, the interface 520, and the nonvolatile memory 540 shown in fig. 4, the electronic device where the session identification apparatus 100 is located in the embodiment may also include other hardware according to an actual function of the electronic device, which is not described again.
The present invention also provides a machine-readable storage medium on which a program is stored, which, when executed by a processor, implements the session identification method as described in the foregoing embodiments.
The present invention may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having program code embodied therein. Machine-readable storage media include both permanent and non-permanent, removable and non-removable media, and the storage of information may be accomplished by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that may be accessed by a computing device.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A conversation link identification method is characterized by comprising the following steps:
obtaining speaking content and audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles;
determining a target characteristic vector corresponding to the target audio paragraph according to the speaking content and the audio characteristic information;
inputting the target feature vector into a trained conversation link recognition model to obtain a conversation link corresponding to the target audio frequency paragraph output by the conversation link recognition model;
the session link recognition model is obtained by training in the following way:
obtaining a second set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, each audio paragraph sample having speech from one or more speech roles and each audio paragraph sample corresponding to a conversational link;
for each audio paragraph sample in a second audio paragraph sample set, obtaining the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample from the audio paragraph sample, inputting the speaking content and the audio characteristic information into a vector model to obtain a characteristic vector output by the vector model as second sample data, and marking a conversation ring section corresponding to the audio paragraph sample as a class label corresponding to the second sample data, wherein the vector model is used for obtaining and outputting the corresponding characteristic vector based on the speaking content and the audio characteristic information;
and training the session link identification model by using each second sample data and the corresponding class label.
2. The method for recognizing conversation link according to claim 1, wherein obtaining the speaking content of the speaking role in the target audio paragraph from the target audio paragraph to be recognized in the conversation audio comprises:
performing audio recognition on the target audio paragraph to obtain at least one recognized reference sentence;
inputting the reference sentence into a trained error correction model, wherein the error correction model is used for correcting the error content in the reference sentence to obtain a candidate sentence output by the error correction model;
determining the utterance content based on the candidate sentence.
3. The method for recognizing conversation link according to claim 1, wherein the obtaining of audio feature information of a speaking role in a target audio paragraph to be recognized from a conversation audio comprises:
inputting the target audio paragraph to a trained audio feature extractor to obtain audio feature information output by the audio feature extractor; the audio feature extractor at least comprises an extraction layer used for extracting audio features from an input audio paragraph and an embedding layer used for embedding and expressing the audio features and outputting the expressed audio feature information;
and determining the audio characteristic information output by the audio characteristic extractor as the audio characteristic information of the speaking role in the target audio paragraph.
4. The method for recognizing conversation link according to claim 1, wherein determining the target feature vector corresponding to the target audio segment according to the speaking content and the audio feature information comprises:
inputting the speaking content and the audio characteristic information into a trained vector model, and determining and outputting a corresponding characteristic vector based on the input speaking content and the audio characteristic information by the vector model;
and determining the feature vector output by the vector model as the target feature vector.
5. The method of claim 4, wherein the vector model is trained by:
obtaining a first set of audio paragraph samples comprising audio paragraph samples partitioned from a plurality of conversational audios, a speech in each audio paragraph sample being from a speech role;
acquiring the speaking content and the audio characteristic information of a speaking role in an audio paragraph sample from the audio paragraph sample aiming at each audio paragraph sample in a first audio paragraph sample set, forming the speaking content and the audio characteristic information of the speaking role in the audio paragraph sample into first sample data, and defining the speaking angle color corresponding to the audio paragraph sample as label information corresponding to the first sample data;
and training the vector model by using each first sample data and the corresponding label information.
6. The method of claim 5, wherein training the vector model using each first sample data and the corresponding label information comprises:
establishing a first model and a second model;
selecting at least one first sample data from the first sample data, inputting the selected first sample data into the first model, extracting a feature vector from the input first sample data by the first model, and outputting the feature vector to the second model, wherein the second model predicts and outputs a speaking role based on the feature vector output by the first model;
optimizing the first model according to the label information corresponding to the selected first sample data and the speaking role output by the second model; and when the training completion condition is not met currently, returning to the step of selecting at least one first sample data from the first sample data, and when the training completion condition is met currently, determining the first model as the vector model.
7. A session identification apparatus, comprising:
the information acquisition module is used for acquiring the speaking content and the audio characteristic information of a speaking role in a target audio paragraph to be identified from the conversation audio, wherein the speaking of the target audio paragraph is from one or more speaking roles;
a target feature vector determining module, configured to determine a target feature vector corresponding to the target audio paragraph according to the speech content and the audio feature information;
the session link identification module is used for inputting the target feature vector to a trained session link identification model so as to obtain a session link corresponding to the target audio paragraph output by the session link identification model;
the session link recognition model is obtained by training through the following modules:
an obtaining module, configured to obtain a second set of audio paragraph samples, where the second set of audio paragraph samples includes audio paragraph samples partitioned from a plurality of conversation audios, where a speech in each audio paragraph sample is from one or more speech roles, and each audio paragraph sample corresponds to a conversation link;
the determining module is used for acquiring the speech content and the audio characteristic information of the speech role in the audio paragraph sample from each audio paragraph sample in the second audio paragraph sample set, inputting the speech content and the audio characteristic information into the vector model to obtain a characteristic vector output by the vector model as second sample data, and calibrating a conversation ring section corresponding to the audio paragraph sample as a class label corresponding to the second sample data, wherein the vector model is used for obtaining and outputting the corresponding characteristic vector based on the speech content and the audio characteristic information;
and the training module is used for training the session link identification model by utilizing each second sample data and the corresponding class label.
8. An electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the session identification method according to any one of claims 1 to 6.
9. A machine-readable storage medium, having stored thereon a program which, when executed by a processor, implements a conversation link identification method according to any one of claims 1 to 6.
CN202010933549.1A 2020-09-08 2020-09-08 Session link identification method, device and equipment and storage medium Active CN111798871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010933549.1A CN111798871B (en) 2020-09-08 2020-09-08 Session link identification method, device and equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010933549.1A CN111798871B (en) 2020-09-08 2020-09-08 Session link identification method, device and equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111798871A CN111798871A (en) 2020-10-20
CN111798871B true CN111798871B (en) 2020-12-29

Family

ID=72834290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010933549.1A Active CN111798871B (en) 2020-09-08 2020-09-08 Session link identification method, device and equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111798871B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705250B (en) * 2021-10-29 2022-02-22 北京明略昭辉科技有限公司 Session content identification method, device, equipment and computer readable medium
CN114186559B (en) * 2021-12-09 2022-09-13 北京深维智信科技有限公司 Method and system for determining role label of session body from sales session

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20060066483A (en) * 2004-12-13 2006-06-16 엘지전자 주식회사 Method for extracting feature vectors for voice recognition
US9892110B2 (en) * 2013-09-09 2018-02-13 Ayasdi, Inc. Automated discovery using textual analysis
US9418660B2 (en) * 2014-01-15 2016-08-16 Cisco Technology, Inc. Crowd sourcing audio transcription via re-speaking
US10522151B2 (en) * 2015-02-03 2019-12-31 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics
CN107562760B (en) * 2016-06-30 2020-11-17 科大讯飞股份有限公司 Voice data processing method and device
CN107578769B (en) * 2016-07-04 2021-03-23 科大讯飞股份有限公司 Voice data labeling method and device
CN108305616B (en) * 2018-01-16 2021-03-16 国家计算机网络与信息安全管理中心 Audio scene recognition method and device based on long-time and short-time feature extraction
CN108764114B (en) * 2018-05-23 2022-09-13 腾讯音乐娱乐科技(深圳)有限公司 Signal identification method and device, storage medium and terminal thereof
CN110415704A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium are put down in court's trial

Also Published As

Publication number Publication date
CN111798871A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN105845134B (en) Spoken language evaluation method and system for freely reading question types
Middag et al. Automated intelligibility assessment of pathological speech using phonological features
Weinberger et al. The Speech Accent Archive: towards a typology of English accents
CN108766415B (en) Voice evaluation method
CN109741732A (en) Name entity recognition method, name entity recognition device, equipment and medium
CN109697988B (en) Voice evaluation method and device
CN112951240B (en) Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN111833853A (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN111798871B (en) Session link identification method, device and equipment and storage medium
KR102186641B1 (en) Method for examining applicant through automated scoring of spoken answer based on artificial intelligence
Kopparapu Non-linguistic analysis of call center conversations
Blanchard et al. Semi-Automatic Detection of Teacher Questions from Human-Transcripts of Audio in Live Classrooms.
Grover et al. Multi-modal automated speech scoring using attention fusion
CN104700831B (en) The method and apparatus for analyzing the phonetic feature of audio file
CN113486970B (en) Reading capability evaluation method and device
CN109697975B (en) Voice evaluation method and device
KR102414626B1 (en) Foreign language pronunciation training and evaluation system
CN112309429A (en) Method, device and equipment for explosion loss detection and computer readable storage medium
CN112687296B (en) Audio disfluency identification method, device, equipment and readable storage medium
CN111798870A (en) Session link determining method, device and equipment and storage medium
CN115206342A (en) Data processing method and device, computer equipment and readable storage medium
Bai Pronunciation Tutor for Deaf Children based on ASR
Bañeras-Roux et al. HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics
CN117275319B (en) Device for training language emphasis ability
Khan et al. The Tarteel dataset: crowd-sourced and labeled Quranic recitation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant