CN113936637A - Voice self-adaptive completion system based on multi-mode knowledge graph - Google Patents
Voice self-adaptive completion system based on multi-mode knowledge graph Download PDFInfo
- Publication number
- CN113936637A CN113936637A CN202111207821.9A CN202111207821A CN113936637A CN 113936637 A CN113936637 A CN 113936637A CN 202111207821 A CN202111207821 A CN 202111207821A CN 113936637 A CN113936637 A CN 113936637A
- Authority
- CN
- China
- Prior art keywords
- voice
- text
- lip
- data
- phoneme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000007781 pre-processing Methods 0.000 claims abstract description 23
- 230000003044 adaptive effect Effects 0.000 claims abstract description 14
- 238000012512 characterization method Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 37
- 230000033001 locomotion Effects 0.000 claims description 31
- 238000001514 detection method Methods 0.000 claims description 24
- 238000010586 diagram Methods 0.000 claims description 21
- 238000009432 framing Methods 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 19
- 238000004458 analytical method Methods 0.000 claims description 15
- 230000002776 aggregation Effects 0.000 claims description 14
- 238000004220 aggregation Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 12
- 238000010191 image analysis Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000003993 interaction Effects 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000005259 measurement Methods 0.000 claims description 10
- 238000013138 pruning Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000013518 transcription Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 241000282414 Homo sapiens Species 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 230000008901 benefit Effects 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 230000009471 action Effects 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims description 3
- 230000002441 reversible effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000035897 transcription Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 2
- 230000002349 favourable effect Effects 0.000 claims 1
- 238000002372 labelling Methods 0.000 claims 1
- 230000011218 segmentation Effects 0.000 claims 1
- 230000002194 synthesizing effect Effects 0.000 claims 1
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000006854 communication Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
A multi-modal knowledge-graph based speech adaptive completion system, comprising: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics. According to the method, the phoneme recognition is carried out when the voice mode is absent through the phoneme reasoning model, meanwhile, the domain conversation modeling is carried out on the historical text generated by the existing voice according to the semantic relation among the entities in the multi-mode knowledge map, so that the text with the semantics is deduced and generated, and the voice is synthesized by combining the waveform characteristics of the voice of the user to form the audio after completion.
Description
Technical Field
The invention relates to a technology in the field of voice processing, in particular to a voice self-adaptive completion system based on a multi-mode knowledge graph and used for a mobile terminal.
Background
The real-time audio and video technology is mostly used for real-time video chat, video conference, remote education, smart home and the like, but in actual use, because data packets may be out of order or lose packets during network transmission, the call quality is greatly reduced due to call jitter, and generally, a receiving end creates audio and video data through a packet loss repair system to fill audio gaps generated by packet loss or network delay. The audio data completion also faces the following problems in the audio and video communication process of the mobile terminal: firstly, the audio generation is dominated by a deep learning method, and the interpretability of the method is low due to the non-transparency of an inference process, so that the method is difficult to design or tune for a scene; secondly, the current technology mainly adopts single-modal data as the basis of model reasoning, ignores the capability of a mobile terminal for perceiving information of multiple modalities, leads to incomplete perception of the system on the data and the information, and forms cognitive limitation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a voice self-adaptive completion system based on a multi-mode knowledge graph, which carries out phoneme recognition when a voice mode is absent through a phoneme reasoning model, and simultaneously carries out domain conversation modeling on a historical text generated by the existing voice according to the semantic relation among entities in the multi-mode knowledge graph, thereby reasoning and generating a text with semantics, and combines the waveform characteristics of the voice of a user to synthesize the voice to form a completed audio.
The invention is realized by the following technical scheme:
the invention relates to a voice self-adaptive completion system based on a multi-mode knowledge graph, which comprises the following steps: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics.
The system is further provided with a multi-mode data aggregation module which stores and associates results of the data receiver and the data analyzer and provides data support for the data analyzer and the data reasoner.
The system is further provided with a model management module which provides calling and updating of the model for the data receiver, the data analyzer and the data reasoner.
The data receiver comprises: data receiving module, pronunciation pre processing module and video pre processing module, wherein: the data receiving module receives and analyzes the applied audio and video data packets, and respectively outputs the voice packets to the voice preprocessing module and the video packets to the video preprocessing module; the voice preprocessing module collects and preprocesses voice packets, low-quality real-time audio lost by the data packets is taken as input, voice modal data is subjected to primary processing through voice data packet detection, voice framing, audio windowing and endpoint detection, preprocessed waveforms are obtained, and the preprocessed waveforms are output to the voice analysis module; the video preprocessing module collects and preprocesses video packets, uses continuous video images as input, and performs primary processing on video modal data through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain preprocessed images and outputs the preprocessed images to the image analysis module.
The data analyzer comprises: the system comprises a voice analysis module, a space-time based image analysis module and a multi-modal information fusion module, wherein: the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of the semantic text reasoning module.
The data reasoner comprises: semantic text reasoning module and voice completion module, wherein: the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
The voice data packet detection, and the obtained active state of the voice data comprises: a voice occurrence area, a voice silence area, and a voice packet loss area, wherein: distinguishing whether the voice is silent is beneficial to reducing unnecessary voice recognition and completion, and identifying the voice data packet area is used for performing voice completion on the partial area in the following process. The voice data packet detection is used for carrying out first classification on voice activity, a voice area is labeled as TRUE or NONE according to whether the voice data packet at the current moment is received or not, the area labeled as NONE is complemented through a semantic text reasoning module and a voice complementing module, and the area labeled as TRUE is used for further distinguishing the areas where voice appears and voice disappears in an endpoint detection mode.
The voice framing is characterized in that the whole signal is unstable due to the time-varying characteristic of the voice signal, and the MFCC feature extraction in the voice analysis module uses Fourier transform and needs stable input signals, so that the voice signal is framed by using the short-time stationarity of the voice signal. The voice framing process adopts an overlapping and segmenting strategy, and samples each frame according to a preset frame length and an overlapping ratio (frame shift), so that the previous frame is smoothly transited to the next frame, and the continuity of samples is kept.
The audio windowing adopts a Hamming window, the voice data in each frame is multiplied by a window function, the data in the middle are highlighted, the data information on two sides is weakened, the problem of frequency spectrum leakage is effectively solved, and therefore Fourier transform is supported.
The end point detection calculates short-term energy and short-term average zero-crossing rate of each frame data, and classifies a voice appearing area and a voice silence area in real time by a double-threshold comparison method, wherein the label is TRUE or FALSE, so that the starting point and the ending point of an effective voice area are positioned from the whole voice activity, and the influence of a silence part and a noise part is avoided.
The video framing adopts the same sampling frequency as the voice framing to convert the video into an image sequence.
The face lip control point detection detects the control points of the lips of the human beings in each frame of image one by one through an external face recognition engine, and the control points comprise the center coordinates of the lips, the coordinates of the upper boundary of the upper lip, the coordinates of the left corner point of the mouth and the like.
According to the lip region scale normalization, because a subsequent image analysis module only focuses on relative movement of each control point of the lip, and the influence of the size of the lip, the face deflection angle and the inclination angle in an image needs to be reduced, a quadrilateral lip detection frame is fitted according to the left and right angular points of the mouth, the center coordinates of the lip, the upper and lower boundary coordinates of the lip and the center coordinates of the lip, the lip control points are rotated and scaled to be of a uniform size through perspective transformation, the scale normalization of the lip region is realized, and the continuity of the moving track of the control points is kept.
The time alignment needs to enable each frame of audio to correspond to one frame of image in order to facilitate cross-mode interaction of a voice mode and a video mode, and the short-time lip control point track can be approximately replaced by a simple curve, and a lip control point set corresponding to each frame of audio is fitted by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
The lip control point space-time diagram is constructed in the following way: the lip control points of all input frames are collected by utilizing the natural connection relation among the control points, the lip control points are connected in each frame according to the relation of the human lip control points, each control point forms a self-loop with the control point, a space diagram of each frame is constructed, the same control points of two adjacent frames are connected to form a time sequence edge, the time sequence edge represents the motion trail information of the lip between two moments, and therefore the space information and the time sequence information of the lip control points are modeled simultaneously.
SaidLip movement characteristics are extracted by the following method: by constructing a space-time graph convolutional neural network, for a current frame, the input of the space graph convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the characteristic dimension of lip control points, the coordinates of the control points are used as characteristics, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points. In space, by adopting a graph partitioning strategy, a graph G of each frame is decomposed into three sub-graphs, namely G1, G2 and G3, which respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of a control point, each control point in G1 is connected with a neighbor control point closer to the center of a lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, and each control point in G3 is connected with the control point, so that the number of convolution kernels with the size of (1, V and V) is 3, and the local characteristics of the adjacent control points are obtained through weighted average. In time, in order to superimpose a time-series feature on a spatial feature of a current frame, a time convolution neural network is adopted, the features of the current frame and a previous T-1 frame of each lip control point are fused by using a convolution kernel with the size of (T,1), and a local feature of each control point changing in time is obtained. By using spatial and temporal convolution, lip motion features are extracted, the output of each frame is (1, V, N)2) In which N is2And splicing the lip motion features of each frame for the number of the extracted features of each control point, and outputting the lip motion features as (T, V, C2) lip track features.
The data aggregation refers to the following steps: the method comprises the steps of defining ontology types, attributes and relations of the ontology types, the attributes and the relations, wherein the ontology types comprise a field, text words, phonemes, waveform characteristics, waveform time sequence characteristics, lip track characteristics and the like, inputting historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are different entities, gathering, storing and associating the entities based on a multi-mode knowledge graph, continuously expanding knowledge in the running process of the system, and providing support for enhancing and verifying text reasoning in subsequent modules. In addition, after data is compiled, the waveform time sequence characteristics and the lip track characteristics are used as the input of the multi-mode fusion module, the historical text is used as the input of the semantic text reasoning module, and the waveform characteristics are used as the input of the voice completion module.
The joint characterization, namely the multi-modal joint characterization based on Seq2Seq, specifically means: the cross-mode interaction is based on a Seq2Seq model, wherein a cross-mode conversion model uses BILSTM as an encoder and a decoder, and a joint representation of two modes is obtained by training through translation from a lip track feature to a waveform time sequence feature and reverse translation from the waveform time sequence feature to the lip track feature.
The characteristic mode of the reinforced lip is as follows: adopting a phoneme inference model with the same structure as the cross-modal transformation model, receiving the joint representation and outputting a time sequence phoneme posterior probability matrix y of (T, | A |) (y1, y2,... multidot., yT,. multidot., yT), wherein: i a | is the size of the phoneme set a to be identified, and each column of y is (yt1, yt 2.., yta.., ytA), which represents the probability that the t-th frame is a certain phoneme a.
The phonemes to be identified comprise: all known phones and BLANK, denoted as "-" for BLANK, distinguish between adjacent phones that are pronounced the same when the output of the LSTM is converted to a sequence of phones.
The phoneme reasoning model is trained in the following way: the CTC is accessed as a transcription layer after the BILSTM of the phoneme prediction model decoder in order to increase the probability p (L | x) that the BILSTM outputs a correct result given an input x. Since one phoneme in L is composed of the prediction results of multiple time slices in y, there may be multiple paths pi that make up L, i.e., B (pi) ═ L, B is the mapping function, thenTranscription layer CTC passage gradientAdjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1In the case of (L), p (L | x) is maximized.
The phoneme recognition is that: for the voice data packet loss area, lip track characteristics extracted from a video are input into a phoneme inference model, each frame obtains a phoneme inference vector with the corresponding size of | A |, each value P (a) in the vector represents the probability that the frame is each phoneme.
The field session modeling means that: deducing the knowledge field of semantic context according to historical texts, such as financial industry, travel activities, living chatting and the like, mainly deducing the knowledge field related to the historical texts by defining discriminative measurement keys of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility larger than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session. Firstly, during training of the domain conversation model, an initial text set of each domain is generated according to conversation samples of different domains, and domain entities and text entities are subjected to many-to-many association in a multi-modal knowledge graph. Then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the part of the original calculated TF is constantly non-negative through a normalization conversion mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text. Secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length For representing the relevance of the entity2 to the entity1 in the text sequence, when the EMI is more than zero, the probability of occurrence is higher if the value is larger, and when the EMI is less than zero, the two entities are mutually exclusive, thereby defining the relevance measure between the text entities in each domain. And finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
The candidate text prediction, namely the candidate text prediction based on the spatiotemporal knowledge graph, specifically means: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the historical text that has been formed can be viewed as a path that walks between knowledge-graph entities, represented by a combination of several text entity binary vectors (from, to). Due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-modal knowledge graph, the multi-modal knowledge graph is overlapped with the path walk graph of each previous time in the time dimension to form a space-time knowledge graph G, wherein the space refers to a solution space. To improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of GWherein P (G)t|Gt-s:t-1) Can be split intoFurther, the air conditioner is provided with a fan, representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered and the capability of frequent pattern mining is provided. Aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 TIs a historical semantic vector of the full-walk path. On the basis of which are And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
The matching of the recognized phonemes with the text in the solution space means: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
The semantic-based statement generation specifically includes: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
The synthesis of the missing speech is as follows: converting the text sequence with the time stamp into a phoneme sequence with the time stamp, learning the mapping of phonemes to waveform characteristics according to the voice waveform characteristics of a user through a TTS model, and reversely converting the characteristics into waveforms through a vocoder.
The voice splicing means that: the existing voice waveform and the complementary voice waveform are spliced after being subjected to telescopic fitting near the connection point, so that the voice transition is smoother and more natural.
Technical effects
Compared with the technical means commonly used in the lip feature extraction, namely, the image features of the lip motion video are extracted and input into the recurrent neural network, the lip control points in the image are used as the emphasis points of the feature extraction, so that irrelevant variables in the feature extraction process are reduced to a great extent, and then the spatiotemporal relation between the lip control points is subjected to spatiotemporal graph modeling and feature extraction, so that the feature extraction process has higher interpretability.
Compared with the common technical means in speech completion, namely speech prediction based on historical speech, the method provided by the invention has the advantage that information relied on by inference is supplemented to a great extent through interaction and complementation of multi-modal characteristics.
The method is based on the preset and evolved reference information in the knowledge graph to carry out the text pruning reasoning mode combining the domain session modeling and the candidate text prediction, extracts the historical semantics of the context and the wandering probability of the text path, combines the domain knowledge, enables the semantics of the result to be more appropriate, and has higher accuracy.
Drawings
FIG. 1 is a process framework diagram of the present invention;
FIG. 2 is a system block diagram of an embodiment of the present invention;
in the figure: three dotted line boxes respectively represent system input, system internal function modules and system output, solid line connecting lines and arrows represent the direction of data flow in the interaction process of each module, solid line connecting lines without arrows represent internal models on which functions depend, and dotted line connecting lines represent external models on which function modules depend.
Detailed Description
As shown in fig. 1, the present embodiment relates to a speech adaptive completion system based on multi-modal knowledge-graph, which includes: the system comprises a voice preprocessing module, a voice analysis module, a video preprocessing module, a spatio-temporal-based image analysis module, a multi-modal data aggregation module, a multi-modal information fusion module, a semantic text reasoning module and a voice completion module, wherein: the voice preprocessing module collects and preprocesses voice packets at a receiving end, low-quality real-time audio lost by the data packets is taken as input, voice modal data is preliminarily processed through voice data packet detection, voice framing, audio windowing and end point detection, and preprocessed waveforms are obtained and output to the voice analysis module; the video preprocessing module collects and preprocesses a video packet at a receiving end, and performs primary processing on video modal data by taking continuous video images as input through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain a preprocessed image and outputs the preprocessed image to the image analysis module; the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode data aggregation module aggregates, stores and associates historical texts, waveform characteristics, waveform time sequence characteristics of the voice modes and lip track characteristics of the video modes, and provides support for fusion and reasoning of subsequent modules; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of a semantic text reasoning module; the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
As shown in fig. 2, the embodiment is divided into a mobile terminal application, a voice adaptive completion system and an infrastructure layer, wherein the voice adaptive completion system is the core content of the whole framework, and includes a voice preprocessing module, a video preprocessing module, a voice analysis module, a spatio-temporal based video analysis module, a multi-modal data aggregation module, a multi-modal information fusion module, a semantic text reasoning module and a voice completion module, and supports the receiving, analysis, reasoning and completion of audio and video data through the interaction and cooperation among the modules.
The voice preprocessing module comprises: voice data packet detecting unit, voice framing unit, audio windowing unit and endpoint detecting unit, wherein: the voice data packet detection unit marks a voice data packet loss area according to whether the voice data packet at the current moment is received, and the voice framing unit performs framing processing on the voice signals by using the short-time stationarity of the voice signals. The voice framing process adopts an overlapped and segmented strategy, each frame is sampled according to a preset frame length and an overlapped ratio (frame shift), so that the previous frame is smoothly transited to the next frame, the continuity of a sample is kept, a Hamming window is adopted by an audio windowing unit, voice data in each frame are multiplied by a window function, data in the middle are highlighted, data information on two sides is weakened, an end point detection unit calculates short-time energy and a short-time average zero crossing rate of each frame data, and a voice appearing area and a voice silence area are classified in real time through a double-threshold comparison method, so that a starting point and an ending point of an effective voice area are positioned from the whole voice activity.
The voice analysis module comprises: speech recognition unit, MFCC feature extraction unit and on the basis of BILSTM time sequence feature identification unit, wherein: the voice recognition unit acquires a text from voice by adopting an external STT model as context of text inference; extracting M-dimensional cepstrum characteristic parameters from the preprocessed voice data of each frame in a Mel scale frequency domain by a feature extraction unit based on MFCC (Mel frequency cepstrum coefficient), achieving the purposes of feature extraction and dimensionality reduction, and splicing the cepstrum characteristic parameters of the current frame and the previous T-1 frame to obtain a waveform characteristic diagram with the size of (1, T, M); however, a pronunciation is usually composed of many frames of speech data, and analyzing the feature vector of a certain frame alone can generate a large error, so that the context feature of the frame in time needs to be combined; based on the BILSTM time sequence feature recognition unit, the deep bidirectional LSTM network is utilized to continuously extract the time sequence feature of each frame context on the basis of the MFCC waveform feature map, and the output size is (T, N)1) Waveform timing characteristic matrix of, N1Is the number of features per frame.
The video preprocessing module comprises: the device comprises a video framing unit, a face lip control point detection unit, a lip region scale normalization unit and a time alignment unit, wherein: the video framing unit converts the video into an image sequence by adopting the same sampling frequency as voice framing according to an input video data packet, the face lip control point detection unit detects control points of human lips in each frame of image one by one through an external face recognition engine, wherein the control points comprise lip center coordinates, upper lip middle-upper boundary coordinates, mouth left corner point coordinates and the like, the lip region scale normalization unit fits a quadrilateral lip detection frame according to the mouth left and right corner points, the lip center coordinates, the lip upper and lower boundary coordinates and the lip center coordinates, the lip control points are rotated and scaled to a uniform size through perspective transformation, so that the scale normalization of the lip area is realized, and the continuity of the moving track of the control points is kept, and the time alignment unit fits a lip control point set corresponding to each frame of audio by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
The spatiotemporal-based image analysis module comprises: lip control point space-time diagram construction unit and lip motion characteristic extraction unit, wherein: the lip control point space-time diagram construction unit utilizes the natural connection relation between the control points to collect the lip control points of all input frames, connecting the control points according to the relationship of human lip control points in each frame, forming a self-loop between each control point and the self-loop, constructing a space diagram of each frame, connecting the same control points of two adjacent frames to form a time sequence edge, representing the lip motion track information between two moments, therefore, the spatial information and the time sequence information of the lip control points are modeled simultaneously, the lip motion characteristic extraction unit constructs a space-time diagram convolutional neural network, for the current frame, the input of the spatial map convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the feature dimension of the lip control points, the coordinates of the control points are used as features, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points. In space, by adopting a graph partitioning strategy, a graph G of each frame is decomposed into three sub-graphs, namely G1, G2 and G3, which respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of a control point, each control point in G1 is connected with a neighbor control point closer to the center of a lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, and each control point in G3 is connected with the control point, so that the number of convolution kernels with the size of (1, V and V) is 3, and the local characteristics of the adjacent control points are obtained through weighted average. In time, in order to superimpose a time-series feature on a spatial feature of a current frame, a time convolution neural network is adopted, the features of the current frame and a previous T-1 frame of each lip control point are fused by using a convolution kernel with the size of (T,1), and a local feature of each control point changing in time is obtained. And extracting lip motion characteristics by using space and time convolution, wherein the output of each frame is (1, V,), the number of the characteristics extracted for each control point is spliced, and the output is (T, V, C2) lip track characteristics.
The multi-mode data aggregation module defines the body types, attributes and relations of the body types, the attributes and the relations of the body types, the attributes, the waveform time sequence characteristics, the lip track characteristics and the like, historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are used as different entities, knowledge is continuously expanded in the running process of the system based on the multi-mode knowledge map aggregation, storage and association of the entities, and support is provided for the enhancement and verification of text reasoning in a subsequent module. In addition, after data is compiled, the waveform time sequence characteristics and the lip track characteristics are used as the input of the multi-mode fusion module, the historical text is used as the input of the semantic text reasoning module, and the waveform characteristics are used as the input of the voice completion module.
The multi-modal information fusion module comprises: a multi-modal joint characterization unit based on Seq2Seq, a lip feature modal characterization enhancement unit and a lip trajectory-based phoneme recognition unit, wherein: the multi-modal joint characterization unit based on the Seq2Seq utilizes BILSTM as an encoder and a decoder for cross-modal conversion, and obtains joint characterization of two modes by training translation from lip track characteristics to waveform time sequence characteristics and reverse translation from the waveform time sequence characteristics to the lip track characteristics, wherein a phoneme inference model of a lip characteristic mode characterization enhancement unit adoptsThe same structure as the cross-mode conversion model, the input is joint representation, and the output is a time sequence phoneme posterior probability matrix y with the size of (T, | a |) (y1, y 2.,. yT.,. and. yT.,. yT), wherein | a | is the size of the phoneme set a to be identified, and each column of y is (yT1, yT 2.,. yta.,. ytA), which represents that the T-th frame is a certain phoneme a probability. The phonemes to be recognized include all known phonemes and BLANK, the BLANK is represented as "-", and when the output of the LSTM is converted into a phoneme sequence, adjacent phonemes with the same pronunciation are distinguished. During the training process, the CTC is accessed as a transcription layer after the BILSTM of the phoneme prediction model decoder, in order to improve the probability p (L | x) that the BILSTM outputs a correct result given an input x. Since one phoneme in L is composed of the prediction results of multiple time slices in y, there may be multiple paths pi that make up L, i.e., B (pi) ═ L, B is the mapping function, thenTranscription layer CTC passage gradientAdjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1And (L) maximizing p (L | x), inputting lip trajectory features extracted from the video into a phoneme inference model for a voice region where the voice data packet is lost by a lip trajectory-based phoneme recognition unit, obtaining a phoneme inference vector of a size | a | for each frame, and expressing a probability that the frame is a phoneme for each frame in the vector.
The semantic text reasoning module comprises: the system comprises a domain session modeling unit, a candidate text prediction unit based on a space-time knowledge graph, a text inference unit and a sentence generation unit based on semantics, wherein: the domain conversation modeling unit identifies the related knowledge domain according to the historical text of the current conversation, and meanwhile, the candidate text prediction unit predicts the candidate text based on the space-time knowledge graph so as to prune and optimize the solution space size of text inference. The text reasoning unit matches the recognized phonemes with the text in the solution space, and the semantic-based sentence generating unit infers and generates a completed text as the input of the voice completing module.
The domain session modeling unit is as follows: deducing the knowledge field of semantic context according to historical texts, such as financial industry, travel activities, living chatting and the like, mainly deducing the knowledge field related to the historical texts by defining discriminative measurement keys of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility larger than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session. Firstly, during training of the domain conversation model, an initial text set of each domain is generated according to conversation samples of different domains, and domain entities and text entities are subjected to many-to-many association in a multi-modal knowledge graph. Then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the part of the original calculated TF is constantly non-negative through a normalization conversion mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text. Secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length For representing the relevance of the entity2 to the entity1 in the text sequence, when the EMI is more than zero, the probability of occurrence is higher if the value is larger, and when the EMI is less than zero, the two entities are mutually exclusive, thereby defining the relevance measure between the text entities in each domain. And finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
The candidate text prediction unit based on the spatio-temporal knowledge map specifically comprises: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the historical text that has been formed can be viewed as a path that walks between knowledge-graph entities, represented by a combination of several text entity binary vectors (from, to). Due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-modal knowledge graph, the multi-modal knowledge graph is overlapped with the path walk graph of each previous time in the time dimension to form a space-time knowledge graph G, wherein the space refers to a solution space. To improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of GWherein P (G)t|Gt-s:t-1) Can be split into Further, the air conditioner is provided with a fan, representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered and the capability of frequent pattern mining is provided. Aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 TIs a historical semantic vector of the full-walk path. On the basis of which are And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
The text reasoning unit: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
The statement generating unit based on the semantics specifically comprises: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
The voice completion module comprises: speech synthesis unit and speech concatenation unit, wherein: the voice synthesis unit converts a text sequence with a timestamp into a phoneme sequence with the timestamp, learns the mapping from phonemes to waveform characteristics according to the voice waveform characteristics of a user through a TTS model, reversely converts the characteristics into waveforms through a vocoder, and splices an existing voice waveform and a complementary voice waveform after performing telescopic fitting near a connection point, so that voice transition is smooth and natural.
TABLE 1 comparison of technical characteristics
The system is based on the knowledge graph, a spatiotemporal graph with structural significance is constructed on the aspect of image feature extraction, a spatiotemporal knowledge graph with semantic association is constructed on the aspect of text reasoning, and the reasoning process is more interpretable; by adopting a cross-modal reasoning mode and taking the multi-modal knowledge graph as data support, the multi-modal information is converged, associated and complemented, and the method has higher integrity; the domain conversation and the time sequence text are modeled, and text reasoning with context is realized based on preset and evolution information in the knowledge graph, so that the accuracy of the result is improved.
Compared with the prior art, the method solves the problems of low accuracy of voice completion, interpretability of the method and low integrity of information and data in the mobile terminal audio and video communication scene, fully applies each modal data of the receiving terminal to carry out text reasoning and voice completion, improves the intelligent level of the system, and provides powerful technical support for repairing the voice data packet.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (10)
1. A system for speech adaptive completion based on multimodal knowledge-graphs, comprising: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics.
2. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data receiver comprises: data receiving module, pronunciation pre processing module and video pre processing module, wherein: the data receiving module receives and analyzes the applied audio and video data packets, and respectively outputs the voice packets to the voice preprocessing module and the video packets to the video preprocessing module; the voice preprocessing module collects and preprocesses voice packets, low-quality real-time audio lost by the data packets is taken as input, voice modal data is subjected to primary processing through voice data packet detection, voice framing, audio windowing and endpoint detection, preprocessed waveforms are obtained, and the preprocessed waveforms are output to the voice analysis module; the video preprocessing module collects and preprocesses a video packet, and takes a continuous video image as input, and performs primary processing on video modal data through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain a preprocessed image and outputs the preprocessed image to the image analysis module;
the voice data packet detection, and the obtained active state of the voice data comprises: a voice occurrence area, a voice silence area, and a voice packet loss area, wherein: distinguishing whether voice is silent is favorable for reducing unnecessary voice recognition and completion, identifying a voice data packet area for performing voice completion on the part of area in the follow-up process, detecting the voice data packet to classify voice activity for the first time, labeling the voice area at the moment as TRUE or NONE according to whether the voice data packet at the current moment is received, completing the area labeled as NONE through a semantic text reasoning module and a voice completion module, and further distinguishing the area where voice appears and voice disappears through an endpoint detection mode in the area labeled TRUE;
in the voice framing, the integral signal is unstable due to the time-varying characteristic of the voice signal, and the MFCC characteristic extraction in the voice analysis module uses Fourier transform and needs stable input signals, so that the voice signal is framed by using the short-time stationarity of the voice signal, and each frame is sampled according to the preset frame length and the preset overlap ratio (frame shift) in the voice framing process by adopting an overlap segmentation strategy, so that the previous frame is in smooth transition to the next frame, and the continuity of the samples is kept;
the video framing adopts the same sampling frequency as the voice framing to convert the video into an image sequence;
the face lip control point detection detects control points of the lips of the human beings in each frame of image one by one through an external face recognition engine, wherein the control points comprise a lip center coordinate, an upper lip middle-upper boundary coordinate and a mouth left corner coordinate;
according to the lip region scale normalization, because a subsequent image analysis module only focuses on the relative motion of each control point of the lip, and the influence of the size of the lip, the deflection angle of the face and the inclination angle in an image needs to be reduced, a quadrilateral lip detection frame is fitted according to the left and right angular points of the mouth, the center coordinates of the lip, the upper and lower boundary coordinates of the lip and the center coordinates of the lip, the lip control points are rotated and scaled to a uniform size through perspective transformation, the scale normalization of the lip region is realized, and the continuity of the moving track of the control points is kept;
the time alignment needs to enable each frame of audio to correspond to one frame of image in order to facilitate cross-mode interaction of a voice mode and a video mode, and the short-time lip control point track can be approximately replaced by a simple curve, and a lip control point set corresponding to each frame of audio is fitted by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
3. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data analyzer comprises: the system comprises a voice analysis module, a space-time based image analysis module and a multi-modal information fusion module, wherein: the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of a semantic text reasoning module;
the lip movement characteristics are extracted in the following way: by constructing a space-time graph convolutional neural network, for a current frame, the input of the space graph convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the characteristic dimension of lip control points, the coordinates of the control points are taken as characteristics, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points, a graph G of each frame is decomposed into three subgraphs of G1, G2 and G3 by adopting a graph partitioning strategy from space, the three subgraphs respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of the control points, each control point in G1 is connected with a neighbor control point closer to the center of the lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, each control point in G3 is connected with the control point, so that the convolution of the graph uses convolution with the size of (1, V, V) kernel number of 3, obtaining local features of adjacent control points through weighted average, adopting a time convolution neural network in time to superpose time sequence features on the spatial features of a current frame, using convolution with the size of (T,1) to check and fuse the features of the current frame and a previous T-1 frame of each lip control point, obtaining the local features of each control point changing in time, extracting lip motion features by using the spatial convolution and the time convolution, wherein the output of each frame is (1, V, N)2) In which N is2Splicing the lip motion characteristics of each frame for the number of the characteristics extracted from each control point, and outputting the lip motion characteristics as (T, V, C2) lip track characteristics;
the data aggregation refers to the following steps: defining body types, attributes and relations of the body types, the attributes and the relations, wherein the body types, the attributes and the relations comprise a field, text words, phonemes, waveform characteristics, waveform time sequence characteristics, lip track characteristics and the like, input of historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are different entities, the entities are gathered, stored and associated based on a multi-mode knowledge map, knowledge is continuously expanded in the running process of a system, support is provided for enhancement and verification of text reasoning in a follow-up module, in addition, after data are completely compiled, the waveform time sequence characteristics and the lip track characteristics are used as input of the multi-mode fusion module, the historical texts are used as input of a semantic text reasoning module, and the waveform characteristics are used as input of a voice completion module;
the joint characterization, namely the multi-modal joint characterization based on Seq2Seq, specifically means: the cross-mode interaction is based on a Seq2Seq model, wherein a cross-mode conversion model uses BILSTM as an encoder and a decoder, and a joint representation of two modes is obtained by training through translation from a lip track feature to a waveform time sequence feature and reverse translation from the waveform time sequence feature to the lip track feature.
4. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data reasoner comprises: semantic text reasoning module and voice completion module, wherein: the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
5. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 3, wherein the enhanced lip feature modalities are: adopting a phoneme inference model with the same structure as the cross-modal transformation model, receiving the joint representation and outputting a time sequence phoneme posterior probability matrix y of (T, | A |) (y1, y2,... multidot., yT,. multidot., yT), wherein: | a | is the size of the phoneme set a to be identified, and each column of y is (yt1, yt 2.., yta.,. ytA), which represents the probability that the t-th frame is a certain phoneme a;
the phoneme reasoning model is trained in the following way: accessing CTC as a transcription layer after BILSTM of a phoneme prediction model decoder, aiming at improving the probability p (L | x) of outputting a correct result of the BILSTM under the condition of giving input x, wherein one phoneme in L consists of prediction results of a plurality of time slices in y, so that a plurality of paths pi forming L are possible, namely B (pi) ═ L, B is a mapping function, and thenTranscription layer CTC passage gradientAdjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1(L) is such that p (L | x) is maximized;
the phoneme recognition is that: for the voice data packet loss area, lip track characteristics extracted from a video are input into a phoneme inference model, each frame obtains a phoneme inference vector with the corresponding size of | A |, each value P (a) in the vector represents the probability that the frame is each phoneme.
6. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 1, wherein the domain session modeling is: deducing the knowledge field where the semantic context is located according to the historical text, such as financial industry, travel activities and living chatting, mainly deducing the knowledge field related to the historical text by defining discriminative measurement key of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility being greater than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session; firstly, when the domain session model is trained, each domain is generated according to session samples of different domainsThe domain entity and the text entity are subjected to many-to-many association in the multi-modal knowledge graph; then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the originally calculated part of the TF is constantly non-negative through a normalization transformation mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text; secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length The method is used for representing the correlation between the entity2 and the entity1 in the text sequence, when the EMI is larger than zero, the probability of occurrence is larger if the value is larger, and when the EMI is smaller than zero, the two entities are mutually exclusive, so that the relevance measurement between the text entities is defined in each field; and finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
7. The multimodal knowledge-graph-based of claim 1The speech adaptive completion system is characterized in that the candidate text prediction, namely the candidate text prediction based on the space-time knowledge map, specifically comprises the following steps: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the formed historical text can be regarded as a path which walks among knowledge graph entities and is represented by a combination of binary vectors (from, to) of a plurality of text entities; due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-mode knowledge graph, the multi-mode knowledge graph is overlapped with a path walking graph at each moment in the past in a time dimension to form a space-time knowledge graph G, wherein a space refers to a solution space; to improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of GWherein P (G)t|Gt-s:t-1) Can be split into Further, the air conditioner is provided with a fan, representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered, and the capability of frequent pattern mining is realized; aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:
efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 THistory semantic vectors of the full-walking path are obtained; on the basis of which are And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
8. The multi-modal knowledge-graph based speech adaptive completion system according to claim 4, wherein said matching the recognized phonemes to the text in the solution space is: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
9. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 1, wherein the semantic sentences are obtained by: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
10. The multi-modal knowledge-graph based speech adaptive completion system according to claim 4, wherein the synthesizing of the missing speech is: converting a text sequence with a timestamp into a phoneme sequence with a timestamp, learning the mapping from phonemes to waveform characteristics according to the waveform characteristics of the voice of a user through a TTS model, and reversely converting the characteristics into waveforms through a vocoder;
the voice splicing means that: the existing voice waveform and the complementary voice waveform are spliced after being subjected to telescopic fitting near the connection point, so that the voice transition is smoother and more natural.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111207821.9A CN113936637A (en) | 2021-10-18 | 2021-10-18 | Voice self-adaptive completion system based on multi-mode knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111207821.9A CN113936637A (en) | 2021-10-18 | 2021-10-18 | Voice self-adaptive completion system based on multi-mode knowledge graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113936637A true CN113936637A (en) | 2022-01-14 |
Family
ID=79280039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111207821.9A Pending CN113936637A (en) | 2021-10-18 | 2021-10-18 | Voice self-adaptive completion system based on multi-mode knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113936637A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114666449A (en) * | 2022-03-29 | 2022-06-24 | 深圳市银服通企业管理咨询有限公司 | Voice data processing method of calling system and calling system |
CN114817456A (en) * | 2022-03-10 | 2022-07-29 | 马上消费金融股份有限公司 | Keyword detection method and device, computer equipment and storage medium |
CN114880484A (en) * | 2022-05-11 | 2022-08-09 | 军事科学院系统工程研究院网络信息研究所 | Satellite communication frequency-orbit resource map construction method based on vector mapping |
CN115860152A (en) * | 2023-02-20 | 2023-03-28 | 南京星耀智能科技有限公司 | Cross-modal joint learning method oriented to character military knowledge discovery |
CN115953521A (en) * | 2023-03-14 | 2023-04-11 | 世优(北京)科技有限公司 | Remote digital human rendering method, device and system |
CN116071740A (en) * | 2023-03-06 | 2023-05-05 | 深圳前海环融联易信息科技服务有限公司 | Invoice identification method, computer equipment and storage medium |
CN116580701A (en) * | 2023-05-19 | 2023-08-11 | 国网物资有限公司 | Alarm audio frequency identification method, device, electronic equipment and computer medium |
-
2021
- 2021-10-18 CN CN202111207821.9A patent/CN113936637A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114817456A (en) * | 2022-03-10 | 2022-07-29 | 马上消费金融股份有限公司 | Keyword detection method and device, computer equipment and storage medium |
CN114817456B (en) * | 2022-03-10 | 2023-09-05 | 马上消费金融股份有限公司 | Keyword detection method, keyword detection device, computer equipment and storage medium |
CN114666449A (en) * | 2022-03-29 | 2022-06-24 | 深圳市银服通企业管理咨询有限公司 | Voice data processing method of calling system and calling system |
CN114880484A (en) * | 2022-05-11 | 2022-08-09 | 军事科学院系统工程研究院网络信息研究所 | Satellite communication frequency-orbit resource map construction method based on vector mapping |
CN114880484B (en) * | 2022-05-11 | 2023-06-16 | 军事科学院系统工程研究院网络信息研究所 | Satellite communication frequency track resource map construction method based on vector mapping |
CN115860152A (en) * | 2023-02-20 | 2023-03-28 | 南京星耀智能科技有限公司 | Cross-modal joint learning method oriented to character military knowledge discovery |
CN115860152B (en) * | 2023-02-20 | 2023-06-27 | 南京星耀智能科技有限公司 | Cross-modal joint learning method for character military knowledge discovery |
CN116071740A (en) * | 2023-03-06 | 2023-05-05 | 深圳前海环融联易信息科技服务有限公司 | Invoice identification method, computer equipment and storage medium |
CN115953521A (en) * | 2023-03-14 | 2023-04-11 | 世优(北京)科技有限公司 | Remote digital human rendering method, device and system |
CN116580701A (en) * | 2023-05-19 | 2023-08-11 | 国网物资有限公司 | Alarm audio frequency identification method, device, electronic equipment and computer medium |
CN116580701B (en) * | 2023-05-19 | 2023-11-24 | 国网物资有限公司 | Alarm audio frequency identification method, device, electronic equipment and computer medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113936637A (en) | Voice self-adaptive completion system based on multi-mode knowledge graph | |
Petridis et al. | Audio-visual speech recognition with a hybrid ctc/attention architecture | |
Ferstl et al. | Multi-objective adversarial gesture generation | |
CN113408385B (en) | Audio and video multi-mode emotion classification method and system | |
CN107423398B (en) | Interaction method, interaction device, storage medium and computer equipment | |
CN112053690B (en) | Cross-mode multi-feature fusion audio/video voice recognition method and system | |
Sadoughi et al. | Speech-driven expressive talking lips with conditional sequential generative adversarial networks | |
CN101187990A (en) | A session robotic system | |
US20040056907A1 (en) | Prosody based audio/visual co-analysis for co-verbal gesture recognition | |
CN110634469B (en) | Speech signal processing method and device based on artificial intelligence and storage medium | |
CN110853656B (en) | Audio tampering identification method based on improved neural network | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN111967272B (en) | Visual dialogue generating system based on semantic alignment | |
CN104541324A (en) | A speech recognition system and a method of using dynamic bayesian network models | |
Elakkiya et al. | Subunit sign modeling framework for continuous sign language recognition | |
CN111461173A (en) | Attention mechanism-based multi-speaker clustering system and method | |
CN112329438A (en) | Automatic lie detection method and system based on domain confrontation training | |
Zhang | Voice keyword retrieval method using attention mechanism and multimodal information fusion | |
CN114610158A (en) | Data processing method and device, electronic equipment and storage medium | |
CN112749567A (en) | Question-answering system based on reality information environment knowledge graph | |
CN116955699A (en) | Video cross-mode search model training method, searching method and device | |
D’Ulizia | Exploring multimodal input fusion strategies | |
CN116312512A (en) | Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device | |
Sartiukova et al. | Remote Voice Control of Computer Based on Convolutional Neural Network | |
Hussain et al. | Deep learning for audio visual emotion recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |