CN113936637A - Voice self-adaptive completion system based on multi-mode knowledge graph - Google Patents

Voice self-adaptive completion system based on multi-mode knowledge graph Download PDF

Info

Publication number
CN113936637A
CN113936637A CN202111207821.9A CN202111207821A CN113936637A CN 113936637 A CN113936637 A CN 113936637A CN 202111207821 A CN202111207821 A CN 202111207821A CN 113936637 A CN113936637 A CN 113936637A
Authority
CN
China
Prior art keywords
voice
text
lip
data
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111207821.9A
Other languages
Chinese (zh)
Inventor
蔡鸿明
李琥
于晗
姜丽红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111207821.9A priority Critical patent/CN113936637A/en
Publication of CN113936637A publication Critical patent/CN113936637A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

A multi-modal knowledge-graph based speech adaptive completion system, comprising: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics. According to the method, the phoneme recognition is carried out when the voice mode is absent through the phoneme reasoning model, meanwhile, the domain conversation modeling is carried out on the historical text generated by the existing voice according to the semantic relation among the entities in the multi-mode knowledge map, so that the text with the semantics is deduced and generated, and the voice is synthesized by combining the waveform characteristics of the voice of the user to form the audio after completion.

Description

Voice self-adaptive completion system based on multi-mode knowledge graph
Technical Field
The invention relates to a technology in the field of voice processing, in particular to a voice self-adaptive completion system based on a multi-mode knowledge graph and used for a mobile terminal.
Background
The real-time audio and video technology is mostly used for real-time video chat, video conference, remote education, smart home and the like, but in actual use, because data packets may be out of order or lose packets during network transmission, the call quality is greatly reduced due to call jitter, and generally, a receiving end creates audio and video data through a packet loss repair system to fill audio gaps generated by packet loss or network delay. The audio data completion also faces the following problems in the audio and video communication process of the mobile terminal: firstly, the audio generation is dominated by a deep learning method, and the interpretability of the method is low due to the non-transparency of an inference process, so that the method is difficult to design or tune for a scene; secondly, the current technology mainly adopts single-modal data as the basis of model reasoning, ignores the capability of a mobile terminal for perceiving information of multiple modalities, leads to incomplete perception of the system on the data and the information, and forms cognitive limitation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a voice self-adaptive completion system based on a multi-mode knowledge graph, which carries out phoneme recognition when a voice mode is absent through a phoneme reasoning model, and simultaneously carries out domain conversation modeling on a historical text generated by the existing voice according to the semantic relation among entities in the multi-mode knowledge graph, thereby reasoning and generating a text with semantics, and combines the waveform characteristics of the voice of a user to synthesize the voice to form a completed audio.
The invention is realized by the following technical scheme:
the invention relates to a voice self-adaptive completion system based on a multi-mode knowledge graph, which comprises the following steps: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics.
The system is further provided with a multi-mode data aggregation module which stores and associates results of the data receiver and the data analyzer and provides data support for the data analyzer and the data reasoner.
The system is further provided with a model management module which provides calling and updating of the model for the data receiver, the data analyzer and the data reasoner.
The data receiver comprises: data receiving module, pronunciation pre processing module and video pre processing module, wherein: the data receiving module receives and analyzes the applied audio and video data packets, and respectively outputs the voice packets to the voice preprocessing module and the video packets to the video preprocessing module; the voice preprocessing module collects and preprocesses voice packets, low-quality real-time audio lost by the data packets is taken as input, voice modal data is subjected to primary processing through voice data packet detection, voice framing, audio windowing and endpoint detection, preprocessed waveforms are obtained, and the preprocessed waveforms are output to the voice analysis module; the video preprocessing module collects and preprocesses video packets, uses continuous video images as input, and performs primary processing on video modal data through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain preprocessed images and outputs the preprocessed images to the image analysis module.
The data analyzer comprises: the system comprises a voice analysis module, a space-time based image analysis module and a multi-modal information fusion module, wherein: the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of the semantic text reasoning module.
The data reasoner comprises: semantic text reasoning module and voice completion module, wherein: the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
The voice data packet detection, and the obtained active state of the voice data comprises: a voice occurrence area, a voice silence area, and a voice packet loss area, wherein: distinguishing whether the voice is silent is beneficial to reducing unnecessary voice recognition and completion, and identifying the voice data packet area is used for performing voice completion on the partial area in the following process. The voice data packet detection is used for carrying out first classification on voice activity, a voice area is labeled as TRUE or NONE according to whether the voice data packet at the current moment is received or not, the area labeled as NONE is complemented through a semantic text reasoning module and a voice complementing module, and the area labeled as TRUE is used for further distinguishing the areas where voice appears and voice disappears in an endpoint detection mode.
The voice framing is characterized in that the whole signal is unstable due to the time-varying characteristic of the voice signal, and the MFCC feature extraction in the voice analysis module uses Fourier transform and needs stable input signals, so that the voice signal is framed by using the short-time stationarity of the voice signal. The voice framing process adopts an overlapping and segmenting strategy, and samples each frame according to a preset frame length and an overlapping ratio (frame shift), so that the previous frame is smoothly transited to the next frame, and the continuity of samples is kept.
The audio windowing adopts a Hamming window, the voice data in each frame is multiplied by a window function, the data in the middle are highlighted, the data information on two sides is weakened, the problem of frequency spectrum leakage is effectively solved, and therefore Fourier transform is supported.
The end point detection calculates short-term energy and short-term average zero-crossing rate of each frame data, and classifies a voice appearing area and a voice silence area in real time by a double-threshold comparison method, wherein the label is TRUE or FALSE, so that the starting point and the ending point of an effective voice area are positioned from the whole voice activity, and the influence of a silence part and a noise part is avoided.
The video framing adopts the same sampling frequency as the voice framing to convert the video into an image sequence.
The face lip control point detection detects the control points of the lips of the human beings in each frame of image one by one through an external face recognition engine, and the control points comprise the center coordinates of the lips, the coordinates of the upper boundary of the upper lip, the coordinates of the left corner point of the mouth and the like.
According to the lip region scale normalization, because a subsequent image analysis module only focuses on relative movement of each control point of the lip, and the influence of the size of the lip, the face deflection angle and the inclination angle in an image needs to be reduced, a quadrilateral lip detection frame is fitted according to the left and right angular points of the mouth, the center coordinates of the lip, the upper and lower boundary coordinates of the lip and the center coordinates of the lip, the lip control points are rotated and scaled to be of a uniform size through perspective transformation, the scale normalization of the lip region is realized, and the continuity of the moving track of the control points is kept.
The time alignment needs to enable each frame of audio to correspond to one frame of image in order to facilitate cross-mode interaction of a voice mode and a video mode, and the short-time lip control point track can be approximately replaced by a simple curve, and a lip control point set corresponding to each frame of audio is fitted by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
The lip control point space-time diagram is constructed in the following way: the lip control points of all input frames are collected by utilizing the natural connection relation among the control points, the lip control points are connected in each frame according to the relation of the human lip control points, each control point forms a self-loop with the control point, a space diagram of each frame is constructed, the same control points of two adjacent frames are connected to form a time sequence edge, the time sequence edge represents the motion trail information of the lip between two moments, and therefore the space information and the time sequence information of the lip control points are modeled simultaneously.
SaidLip movement characteristics are extracted by the following method: by constructing a space-time graph convolutional neural network, for a current frame, the input of the space graph convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the characteristic dimension of lip control points, the coordinates of the control points are used as characteristics, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points. In space, by adopting a graph partitioning strategy, a graph G of each frame is decomposed into three sub-graphs, namely G1, G2 and G3, which respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of a control point, each control point in G1 is connected with a neighbor control point closer to the center of a lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, and each control point in G3 is connected with the control point, so that the number of convolution kernels with the size of (1, V and V) is 3, and the local characteristics of the adjacent control points are obtained through weighted average. In time, in order to superimpose a time-series feature on a spatial feature of a current frame, a time convolution neural network is adopted, the features of the current frame and a previous T-1 frame of each lip control point are fused by using a convolution kernel with the size of (T,1), and a local feature of each control point changing in time is obtained. By using spatial and temporal convolution, lip motion features are extracted, the output of each frame is (1, V, N)2) In which N is2And splicing the lip motion features of each frame for the number of the extracted features of each control point, and outputting the lip motion features as (T, V, C2) lip track features.
The data aggregation refers to the following steps: the method comprises the steps of defining ontology types, attributes and relations of the ontology types, the attributes and the relations, wherein the ontology types comprise a field, text words, phonemes, waveform characteristics, waveform time sequence characteristics, lip track characteristics and the like, inputting historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are different entities, gathering, storing and associating the entities based on a multi-mode knowledge graph, continuously expanding knowledge in the running process of the system, and providing support for enhancing and verifying text reasoning in subsequent modules. In addition, after data is compiled, the waveform time sequence characteristics and the lip track characteristics are used as the input of the multi-mode fusion module, the historical text is used as the input of the semantic text reasoning module, and the waveform characteristics are used as the input of the voice completion module.
The joint characterization, namely the multi-modal joint characterization based on Seq2Seq, specifically means: the cross-mode interaction is based on a Seq2Seq model, wherein a cross-mode conversion model uses BILSTM as an encoder and a decoder, and a joint representation of two modes is obtained by training through translation from a lip track feature to a waveform time sequence feature and reverse translation from the waveform time sequence feature to the lip track feature.
The characteristic mode of the reinforced lip is as follows: adopting a phoneme inference model with the same structure as the cross-modal transformation model, receiving the joint representation and outputting a time sequence phoneme posterior probability matrix y of (T, | A |) (y1, y2,... multidot., yT,. multidot., yT), wherein: i a | is the size of the phoneme set a to be identified, and each column of y is (yt1, yt 2.., yta.., ytA), which represents the probability that the t-th frame is a certain phoneme a.
The phonemes to be identified comprise: all known phones and BLANK, denoted as "-" for BLANK, distinguish between adjacent phones that are pronounced the same when the output of the LSTM is converted to a sequence of phones.
The phoneme reasoning model is trained in the following way: the CTC is accessed as a transcription layer after the BILSTM of the phoneme prediction model decoder in order to increase the probability p (L | x) that the BILSTM outputs a correct result given an input x. Since one phoneme in L is composed of the prediction results of multiple time slices in y, there may be multiple paths pi that make up L, i.e., B (pi) ═ L, B is the mapping function, then
Figure BDA0003307583120000042
Transcription layer CTC passage gradient
Figure BDA0003307583120000041
Adjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1In the case of (L), p (L | x) is maximized.
The phoneme recognition is that: for the voice data packet loss area, lip track characteristics extracted from a video are input into a phoneme inference model, each frame obtains a phoneme inference vector with the corresponding size of | A |, each value P (a) in the vector represents the probability that the frame is each phoneme.
The field session modeling means that: deducing the knowledge field of semantic context according to historical texts, such as financial industry, travel activities, living chatting and the like, mainly deducing the knowledge field related to the historical texts by defining discriminative measurement keys of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility larger than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session. Firstly, during training of the domain conversation model, an initial text set of each domain is generated according to conversation samples of different domains, and domain entities and text entities are subjected to many-to-many association in a multi-modal knowledge graph. Then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula
Figure BDA0003307583120000051
Figure BDA0003307583120000052
The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the part of the original calculated TF is constantly non-negative through a normalization conversion mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text. Secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length
Figure BDA0003307583120000053
Figure BDA0003307583120000054
For representing the relevance of the entity2 to the entity1 in the text sequence, when the EMI is more than zero, the probability of occurrence is higher if the value is larger, and when the EMI is less than zero, the two entities are mutually exclusive, thereby defining the relevance measure between the text entities in each domain. And finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
The candidate text prediction, namely the candidate text prediction based on the spatiotemporal knowledge graph, specifically means: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the historical text that has been formed can be viewed as a path that walks between knowledge-graph entities, represented by a combination of several text entity binary vectors (from, to). Due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-modal knowledge graph, the multi-modal knowledge graph is overlapped with the path walk graph of each previous time in the time dimension to form a space-time knowledge graph G, wherein the space refers to a solution space. To improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of G
Figure BDA0003307583120000055
Wherein P (G)t|Gt-s:t-1) Can be split into
Figure BDA00033075831200000512
Further, the air conditioner is provided with a fan,
Figure BDA0003307583120000056
Figure BDA0003307583120000057
Figure BDA0003307583120000058
representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered and the capability of frequent pattern mining is provided. Aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:
Figure BDA0003307583120000059
efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 TIs a historical semantic vector of the full-walk path. On the basis of which are
Figure BDA00033075831200000510
Figure BDA00033075831200000511
And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
The matching of the recognized phonemes with the text in the solution space means: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
The semantic-based statement generation specifically includes: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
The synthesis of the missing speech is as follows: converting the text sequence with the time stamp into a phoneme sequence with the time stamp, learning the mapping of phonemes to waveform characteristics according to the voice waveform characteristics of a user through a TTS model, and reversely converting the characteristics into waveforms through a vocoder.
The voice splicing means that: the existing voice waveform and the complementary voice waveform are spliced after being subjected to telescopic fitting near the connection point, so that the voice transition is smoother and more natural.
Technical effects
Compared with the technical means commonly used in the lip feature extraction, namely, the image features of the lip motion video are extracted and input into the recurrent neural network, the lip control points in the image are used as the emphasis points of the feature extraction, so that irrelevant variables in the feature extraction process are reduced to a great extent, and then the spatiotemporal relation between the lip control points is subjected to spatiotemporal graph modeling and feature extraction, so that the feature extraction process has higher interpretability.
Compared with the common technical means in speech completion, namely speech prediction based on historical speech, the method provided by the invention has the advantage that information relied on by inference is supplemented to a great extent through interaction and complementation of multi-modal characteristics.
The method is based on the preset and evolved reference information in the knowledge graph to carry out the text pruning reasoning mode combining the domain session modeling and the candidate text prediction, extracts the historical semantics of the context and the wandering probability of the text path, combines the domain knowledge, enables the semantics of the result to be more appropriate, and has higher accuracy.
Drawings
FIG. 1 is a process framework diagram of the present invention;
FIG. 2 is a system block diagram of an embodiment of the present invention;
in the figure: three dotted line boxes respectively represent system input, system internal function modules and system output, solid line connecting lines and arrows represent the direction of data flow in the interaction process of each module, solid line connecting lines without arrows represent internal models on which functions depend, and dotted line connecting lines represent external models on which function modules depend.
Detailed Description
As shown in fig. 1, the present embodiment relates to a speech adaptive completion system based on multi-modal knowledge-graph, which includes: the system comprises a voice preprocessing module, a voice analysis module, a video preprocessing module, a spatio-temporal-based image analysis module, a multi-modal data aggregation module, a multi-modal information fusion module, a semantic text reasoning module and a voice completion module, wherein: the voice preprocessing module collects and preprocesses voice packets at a receiving end, low-quality real-time audio lost by the data packets is taken as input, voice modal data is preliminarily processed through voice data packet detection, voice framing, audio windowing and end point detection, and preprocessed waveforms are obtained and output to the voice analysis module; the video preprocessing module collects and preprocesses a video packet at a receiving end, and performs primary processing on video modal data by taking continuous video images as input through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain a preprocessed image and outputs the preprocessed image to the image analysis module; the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode data aggregation module aggregates, stores and associates historical texts, waveform characteristics, waveform time sequence characteristics of the voice modes and lip track characteristics of the video modes, and provides support for fusion and reasoning of subsequent modules; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of a semantic text reasoning module; the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
As shown in fig. 2, the embodiment is divided into a mobile terminal application, a voice adaptive completion system and an infrastructure layer, wherein the voice adaptive completion system is the core content of the whole framework, and includes a voice preprocessing module, a video preprocessing module, a voice analysis module, a spatio-temporal based video analysis module, a multi-modal data aggregation module, a multi-modal information fusion module, a semantic text reasoning module and a voice completion module, and supports the receiving, analysis, reasoning and completion of audio and video data through the interaction and cooperation among the modules.
The voice preprocessing module comprises: voice data packet detecting unit, voice framing unit, audio windowing unit and endpoint detecting unit, wherein: the voice data packet detection unit marks a voice data packet loss area according to whether the voice data packet at the current moment is received, and the voice framing unit performs framing processing on the voice signals by using the short-time stationarity of the voice signals. The voice framing process adopts an overlapped and segmented strategy, each frame is sampled according to a preset frame length and an overlapped ratio (frame shift), so that the previous frame is smoothly transited to the next frame, the continuity of a sample is kept, a Hamming window is adopted by an audio windowing unit, voice data in each frame are multiplied by a window function, data in the middle are highlighted, data information on two sides is weakened, an end point detection unit calculates short-time energy and a short-time average zero crossing rate of each frame data, and a voice appearing area and a voice silence area are classified in real time through a double-threshold comparison method, so that a starting point and an ending point of an effective voice area are positioned from the whole voice activity.
The voice analysis module comprises: speech recognition unit, MFCC feature extraction unit and on the basis of BILSTM time sequence feature identification unit, wherein: the voice recognition unit acquires a text from voice by adopting an external STT model as context of text inference; extracting M-dimensional cepstrum characteristic parameters from the preprocessed voice data of each frame in a Mel scale frequency domain by a feature extraction unit based on MFCC (Mel frequency cepstrum coefficient), achieving the purposes of feature extraction and dimensionality reduction, and splicing the cepstrum characteristic parameters of the current frame and the previous T-1 frame to obtain a waveform characteristic diagram with the size of (1, T, M); however, a pronunciation is usually composed of many frames of speech data, and analyzing the feature vector of a certain frame alone can generate a large error, so that the context feature of the frame in time needs to be combined; based on the BILSTM time sequence feature recognition unit, the deep bidirectional LSTM network is utilized to continuously extract the time sequence feature of each frame context on the basis of the MFCC waveform feature map, and the output size is (T, N)1) Waveform timing characteristic matrix of, N1Is the number of features per frame.
The video preprocessing module comprises: the device comprises a video framing unit, a face lip control point detection unit, a lip region scale normalization unit and a time alignment unit, wherein: the video framing unit converts the video into an image sequence by adopting the same sampling frequency as voice framing according to an input video data packet, the face lip control point detection unit detects control points of human lips in each frame of image one by one through an external face recognition engine, wherein the control points comprise lip center coordinates, upper lip middle-upper boundary coordinates, mouth left corner point coordinates and the like, the lip region scale normalization unit fits a quadrilateral lip detection frame according to the mouth left and right corner points, the lip center coordinates, the lip upper and lower boundary coordinates and the lip center coordinates, the lip control points are rotated and scaled to a uniform size through perspective transformation, so that the scale normalization of the lip area is realized, and the continuity of the moving track of the control points is kept, and the time alignment unit fits a lip control point set corresponding to each frame of audio by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
The spatiotemporal-based image analysis module comprises: lip control point space-time diagram construction unit and lip motion characteristic extraction unit, wherein: the lip control point space-time diagram construction unit utilizes the natural connection relation between the control points to collect the lip control points of all input frames, connecting the control points according to the relationship of human lip control points in each frame, forming a self-loop between each control point and the self-loop, constructing a space diagram of each frame, connecting the same control points of two adjacent frames to form a time sequence edge, representing the lip motion track information between two moments, therefore, the spatial information and the time sequence information of the lip control points are modeled simultaneously, the lip motion characteristic extraction unit constructs a space-time diagram convolutional neural network, for the current frame, the input of the spatial map convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the feature dimension of the lip control points, the coordinates of the control points are used as features, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points. In space, by adopting a graph partitioning strategy, a graph G of each frame is decomposed into three sub-graphs, namely G1, G2 and G3, which respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of a control point, each control point in G1 is connected with a neighbor control point closer to the center of a lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, and each control point in G3 is connected with the control point, so that the number of convolution kernels with the size of (1, V and V) is 3, and the local characteristics of the adjacent control points are obtained through weighted average. In time, in order to superimpose a time-series feature on a spatial feature of a current frame, a time convolution neural network is adopted, the features of the current frame and a previous T-1 frame of each lip control point are fused by using a convolution kernel with the size of (T,1), and a local feature of each control point changing in time is obtained. And extracting lip motion characteristics by using space and time convolution, wherein the output of each frame is (1, V,), the number of the characteristics extracted for each control point is spliced, and the output is (T, V, C2) lip track characteristics.
The multi-mode data aggregation module defines the body types, attributes and relations of the body types, the attributes and the relations of the body types, the attributes, the waveform time sequence characteristics, the lip track characteristics and the like, historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are used as different entities, knowledge is continuously expanded in the running process of the system based on the multi-mode knowledge map aggregation, storage and association of the entities, and support is provided for the enhancement and verification of text reasoning in a subsequent module. In addition, after data is compiled, the waveform time sequence characteristics and the lip track characteristics are used as the input of the multi-mode fusion module, the historical text is used as the input of the semantic text reasoning module, and the waveform characteristics are used as the input of the voice completion module.
The multi-modal information fusion module comprises: a multi-modal joint characterization unit based on Seq2Seq, a lip feature modal characterization enhancement unit and a lip trajectory-based phoneme recognition unit, wherein: the multi-modal joint characterization unit based on the Seq2Seq utilizes BILSTM as an encoder and a decoder for cross-modal conversion, and obtains joint characterization of two modes by training translation from lip track characteristics to waveform time sequence characteristics and reverse translation from the waveform time sequence characteristics to the lip track characteristics, wherein a phoneme inference model of a lip characteristic mode characterization enhancement unit adoptsThe same structure as the cross-mode conversion model, the input is joint representation, and the output is a time sequence phoneme posterior probability matrix y with the size of (T, | a |) (y1, y 2.,. yT.,. and. yT.,. yT), wherein | a | is the size of the phoneme set a to be identified, and each column of y is (yT1, yT 2.,. yta.,. ytA), which represents that the T-th frame is a certain phoneme a probability. The phonemes to be recognized include all known phonemes and BLANK, the BLANK is represented as "-", and when the output of the LSTM is converted into a phoneme sequence, adjacent phonemes with the same pronunciation are distinguished. During the training process, the CTC is accessed as a transcription layer after the BILSTM of the phoneme prediction model decoder, in order to improve the probability p (L | x) that the BILSTM outputs a correct result given an input x. Since one phoneme in L is composed of the prediction results of multiple time slices in y, there may be multiple paths pi that make up L, i.e., B (pi) ═ L, B is the mapping function, then
Figure BDA0003307583120000092
Transcription layer CTC passage gradient
Figure BDA0003307583120000091
Adjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1And (L) maximizing p (L | x), inputting lip trajectory features extracted from the video into a phoneme inference model for a voice region where the voice data packet is lost by a lip trajectory-based phoneme recognition unit, obtaining a phoneme inference vector of a size | a | for each frame, and expressing a probability that the frame is a phoneme for each frame in the vector.
The semantic text reasoning module comprises: the system comprises a domain session modeling unit, a candidate text prediction unit based on a space-time knowledge graph, a text inference unit and a sentence generation unit based on semantics, wherein: the domain conversation modeling unit identifies the related knowledge domain according to the historical text of the current conversation, and meanwhile, the candidate text prediction unit predicts the candidate text based on the space-time knowledge graph so as to prune and optimize the solution space size of text inference. The text reasoning unit matches the recognized phonemes with the text in the solution space, and the semantic-based sentence generating unit infers and generates a completed text as the input of the voice completing module.
The domain session modeling unit is as follows: deducing the knowledge field of semantic context according to historical texts, such as financial industry, travel activities, living chatting and the like, mainly deducing the knowledge field related to the historical texts by defining discriminative measurement keys of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility larger than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session. Firstly, during training of the domain conversation model, an initial text set of each domain is generated according to conversation samples of different domains, and domain entities and text entities are subjected to many-to-many association in a multi-modal knowledge graph. Then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula
Figure BDA0003307583120000101
Figure BDA0003307583120000102
The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the part of the original calculated TF is constantly non-negative through a normalization conversion mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text. Secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length
Figure BDA0003307583120000103
Figure BDA0003307583120000104
For representing the relevance of the entity2 to the entity1 in the text sequence, when the EMI is more than zero, the probability of occurrence is higher if the value is larger, and when the EMI is less than zero, the two entities are mutually exclusive, thereby defining the relevance measure between the text entities in each domain. And finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
The candidate text prediction unit based on the spatio-temporal knowledge map specifically comprises: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the historical text that has been formed can be viewed as a path that walks between knowledge-graph entities, represented by a combination of several text entity binary vectors (from, to). Due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-modal knowledge graph, the multi-modal knowledge graph is overlapped with the path walk graph of each previous time in the time dimension to form a space-time knowledge graph G, wherein the space refers to a solution space. To improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of G
Figure BDA0003307583120000111
Wherein P (G)t|Gt-s:t-1) Can be split into
Figure BDA0003307583120000119
Figure BDA00033075831200001110
Further, the air conditioner is provided with a fan,
Figure BDA0003307583120000112
Figure BDA0003307583120000113
Figure BDA0003307583120000114
representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered and the capability of frequent pattern mining is provided. Aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:
Figure BDA0003307583120000115
efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 TIs a historical semantic vector of the full-walk path. On the basis of which are
Figure BDA0003307583120000116
Figure BDA0003307583120000117
And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
The text reasoning unit: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
The statement generating unit based on the semantics specifically comprises: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
The voice completion module comprises: speech synthesis unit and speech concatenation unit, wherein: the voice synthesis unit converts a text sequence with a timestamp into a phoneme sequence with the timestamp, learns the mapping from phonemes to waveform characteristics according to the voice waveform characteristics of a user through a TTS model, reversely converts the characteristics into waveforms through a vocoder, and splices an existing voice waveform and a complementary voice waveform after performing telescopic fitting near a connection point, so that voice transition is smooth and natural.
TABLE 1 comparison of technical characteristics
Figure BDA0003307583120000118
Figure BDA0003307583120000121
The system is based on the knowledge graph, a spatiotemporal graph with structural significance is constructed on the aspect of image feature extraction, a spatiotemporal knowledge graph with semantic association is constructed on the aspect of text reasoning, and the reasoning process is more interpretable; by adopting a cross-modal reasoning mode and taking the multi-modal knowledge graph as data support, the multi-modal information is converged, associated and complemented, and the method has higher integrity; the domain conversation and the time sequence text are modeled, and text reasoning with context is realized based on preset and evolution information in the knowledge graph, so that the accuracy of the result is improved.
Compared with the prior art, the method solves the problems of low accuracy of voice completion, interpretability of the method and low integrity of information and data in the mobile terminal audio and video communication scene, fully applies each modal data of the receiving terminal to carry out text reasoning and voice completion, improves the intelligent level of the system, and provides powerful technical support for repairing the voice data packet.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (10)

1. A system for speech adaptive completion based on multimodal knowledge-graphs, comprising: a data receiver, a data analyzer, and a data reasoner, wherein: the data receiver performs preprocessing according to the received audio and video data and outputs the preprocessed audio and video data to the data analyzer; the data analyzer analyzes the voice and the image to extract a waveform time sequence characteristic and a lip track characteristic, and a phoneme sequence is obtained through multi-modal combined characterization; and the data inference device carries out domain conversation modeling and candidate text prediction according to the historical text, carries out text inference by combining the phoneme sequence to obtain sentences with semantics, and synthesizes complete voice according to the waveform characteristics.
2. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data receiver comprises: data receiving module, pronunciation pre processing module and video pre processing module, wherein: the data receiving module receives and analyzes the applied audio and video data packets, and respectively outputs the voice packets to the voice preprocessing module and the video packets to the video preprocessing module; the voice preprocessing module collects and preprocesses voice packets, low-quality real-time audio lost by the data packets is taken as input, voice modal data is subjected to primary processing through voice data packet detection, voice framing, audio windowing and endpoint detection, preprocessed waveforms are obtained, and the preprocessed waveforms are output to the voice analysis module; the video preprocessing module collects and preprocesses a video packet, and takes a continuous video image as input, and performs primary processing on video modal data through video framing, face lip control point detection, lip area scale normalization and time alignment in sequence to obtain a preprocessed image and outputs the preprocessed image to the image analysis module;
the voice data packet detection, and the obtained active state of the voice data comprises: a voice occurrence area, a voice silence area, and a voice packet loss area, wherein: distinguishing whether voice is silent is favorable for reducing unnecessary voice recognition and completion, identifying a voice data packet area for performing voice completion on the part of area in the follow-up process, detecting the voice data packet to classify voice activity for the first time, labeling the voice area at the moment as TRUE or NONE according to whether the voice data packet at the current moment is received, completing the area labeled as NONE through a semantic text reasoning module and a voice completion module, and further distinguishing the area where voice appears and voice disappears through an endpoint detection mode in the area labeled TRUE;
in the voice framing, the integral signal is unstable due to the time-varying characteristic of the voice signal, and the MFCC characteristic extraction in the voice analysis module uses Fourier transform and needs stable input signals, so that the voice signal is framed by using the short-time stationarity of the voice signal, and each frame is sampled according to the preset frame length and the preset overlap ratio (frame shift) in the voice framing process by adopting an overlap segmentation strategy, so that the previous frame is in smooth transition to the next frame, and the continuity of the samples is kept;
the video framing adopts the same sampling frequency as the voice framing to convert the video into an image sequence;
the face lip control point detection detects control points of the lips of the human beings in each frame of image one by one through an external face recognition engine, wherein the control points comprise a lip center coordinate, an upper lip middle-upper boundary coordinate and a mouth left corner coordinate;
according to the lip region scale normalization, because a subsequent image analysis module only focuses on the relative motion of each control point of the lip, and the influence of the size of the lip, the deflection angle of the face and the inclination angle in an image needs to be reduced, a quadrilateral lip detection frame is fitted according to the left and right angular points of the mouth, the center coordinates of the lip, the upper and lower boundary coordinates of the lip and the center coordinates of the lip, the lip control points are rotated and scaled to a uniform size through perspective transformation, the scale normalization of the lip region is realized, and the continuity of the moving track of the control points is kept;
the time alignment needs to enable each frame of audio to correspond to one frame of image in order to facilitate cross-mode interaction of a voice mode and a video mode, and the short-time lip control point track can be approximately replaced by a simple curve, and a lip control point set corresponding to each frame of audio is fitted by adopting a Lagrange interpolation method, so that the time alignment from the image to the audio is realized.
3. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data analyzer comprises: the system comprises a voice analysis module, a space-time based image analysis module and a multi-modal information fusion module, wherein: the voice analysis module extracts historical texts, waveform characteristics and waveform time sequence characteristics from the preprocessed waveforms, and outputs the historical texts, the waveform characteristics and the waveform time sequence characteristics as voice modal data to the multi-modal data convergence module; constructing a space-time diagram for each preprocessed frame lip control point set by a space-time-based image analysis module, constructing a space-time diagram convolutional neural network, extracting lip motion characteristics of each frame according to front and back information of each frame in the space-time diagram, combining the lip motion characteristics to form lip track characteristics, and inputting the lip track characteristics to a multi-mode data aggregation module as video modal data; the multi-mode fusion module is used for realizing feature alignment of the waveform time sequence feature and the lip track feature in a cross-mode interaction mode, training is carried out to obtain a cross-mode conversion model, hidden state features in the process of mutual conversion of the lip track feature and the waveform time sequence feature are used as joint representation between two modes, joint representation information is converted into phoneme information through a training phoneme prediction model, the representation capability of the lip feature mode on the phoneme information is enhanced, phoneme recognition is carried out on a voice data packet loss area based on the lip track feature, and the phoneme sequence is spliced to be used as the input of a semantic text reasoning module;
the lip movement characteristics are extracted in the following way: by constructing a space-time graph convolutional neural network, for a current frame, the input of the space graph convolutional neural network is represented by a 3-dimensional matrix (C, T, V), wherein C represents the characteristic dimension of lip control points, the coordinates of the control points are taken as characteristics, T represents the current frame and the previous T-1 frame, and V represents the number of the lip control points, a graph G of each frame is decomposed into three subgraphs of G1, G2 and G3 by adopting a graph partitioning strategy from space, the three subgraphs respectively represent the action characteristics of centripetal motion, centrifugal motion and static motion of the control points, each control point in G1 is connected with a neighbor control point closer to the center of the lip than the control point, each control point in G2 is connected with a neighbor control point farther from the center of the lip than the control point, each control point in G3 is connected with the control point, so that the convolution of the graph uses convolution with the size of (1, V, V) kernel number of 3, obtaining local features of adjacent control points through weighted average, adopting a time convolution neural network in time to superpose time sequence features on the spatial features of a current frame, using convolution with the size of (T,1) to check and fuse the features of the current frame and a previous T-1 frame of each lip control point, obtaining the local features of each control point changing in time, extracting lip motion features by using the spatial convolution and the time convolution, wherein the output of each frame is (1, V, N)2) In which N is2Splicing the lip motion characteristics of each frame for the number of the characteristics extracted from each control point, and outputting the lip motion characteristics as (T, V, C2) lip track characteristics;
the data aggregation refers to the following steps: defining body types, attributes and relations of the body types, the attributes and the relations, wherein the body types, the attributes and the relations comprise a field, text words, phonemes, waveform characteristics, waveform time sequence characteristics, lip track characteristics and the like, input of historical texts in a voice mode, the waveform characteristics, the waveform time sequence characteristics and the lip track characteristics in a video mode are different entities, the entities are gathered, stored and associated based on a multi-mode knowledge map, knowledge is continuously expanded in the running process of a system, support is provided for enhancement and verification of text reasoning in a follow-up module, in addition, after data are completely compiled, the waveform time sequence characteristics and the lip track characteristics are used as input of the multi-mode fusion module, the historical texts are used as input of a semantic text reasoning module, and the waveform characteristics are used as input of a voice completion module;
the joint characterization, namely the multi-modal joint characterization based on Seq2Seq, specifically means: the cross-mode interaction is based on a Seq2Seq model, wherein a cross-mode conversion model uses BILSTM as an encoder and a decoder, and a joint representation of two modes is obtained by training through translation from a lip track feature to a waveform time sequence feature and reverse translation from the waveform time sequence feature to the lip track feature.
4. The multi-modal knowledge-graph based speech adaptive completion system of claim 1, wherein said data reasoner comprises: semantic text reasoning module and voice completion module, wherein: the semantic text reasoning module identifies the related knowledge field according to the historical text of the current conversation aiming at the voice packet loss area, predicts the candidate text based on the spatio-temporal knowledge map, is used for pruning the size of a solution space for optimizing text reasoning, matches the identified phonemes with the text in the solution space, and accordingly infers and generates a completed text and outputs the completed text to the voice completing module; the voice completion module synthesizes the missing voice according to the completed text and the collected waveform characteristics of the user voice, and fills the completed voice into the original voice through voice splicing to form a complete and natural voice segment.
5. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 3, wherein the enhanced lip feature modalities are: adopting a phoneme inference model with the same structure as the cross-modal transformation model, receiving the joint representation and outputting a time sequence phoneme posterior probability matrix y of (T, | A |) (y1, y2,... multidot., yT,. multidot., yT), wherein: | a | is the size of the phoneme set a to be identified, and each column of y is (yt1, yt 2.., yta.,. ytA), which represents the probability that the t-th frame is a certain phoneme a;
the phoneme reasoning model is trained in the following way: accessing CTC as a transcription layer after BILSTM of a phoneme prediction model decoder, aiming at improving the probability p (L | x) of outputting a correct result of the BILSTM under the condition of giving input x, wherein one phoneme in L consists of prediction results of a plurality of time slices in y, so that a plurality of paths pi forming L are possible, namely B (pi) ═ L, B is a mapping function, and then
Figure FDA0003307583110000031
Transcription layer CTC passage gradient
Figure FDA0003307583110000032
Adjusting the parameter ω of the LSTM such that π ∈ B for the input samples-1(L) is such that p (L | x) is maximized;
the phoneme recognition is that: for the voice data packet loss area, lip track characteristics extracted from a video are input into a phoneme inference model, each frame obtains a phoneme inference vector with the corresponding size of | A |, each value P (a) in the vector represents the probability that the frame is each phoneme.
6. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 1, wherein the domain session modeling is: deducing the knowledge field where the semantic context is located according to the historical text, such as financial industry, travel activities and living chatting, mainly deducing the knowledge field related to the historical text by defining discriminative measurement key of domain keywords, representing the possibility of entity w appearing behind entity e in a certain text step by combining time sequence relevance measurement EMI (e, w) among text entities, and splicing EMI (e, w) with the possibility being greater than a certain threshold value into a domain text vector as output, thereby realizing the modeling of the domain session; firstly, when the domain session model is trained, each domain is generated according to session samples of different domainsThe domain entity and the text entity are subjected to many-to-many association in the multi-modal knowledge graph; then, in order to generate a domain keyword having an identification degree, the frequency f of occurrence of each text word j in the domain text set i is calculated for each text word jijCounting the maximum frequency max _ f and calculating the number N of occurrences of each text word in the N fieldsjBy the formula
Figure FDA0003307583110000041
Figure FDA0003307583110000042
The discrimination degree of the text word j to the field i is obtained through calculation, compared with a TF-IDF formula, the calculation formula not only considers the length of a field text set and converts the text frequency into frequency, but also ensures that the originally calculated part of the TF is constantly non-negative through a normalization transformation mode, thereby realizing the discrimination measurement of the field keywords, and further identifying the related knowledge field through searching the keywords in the historical text; secondly, in each knowledge field, the mutual information between the entities is obtained by calculating the successive occurrence frequency of two text entities within a certain text step length
Figure FDA0003307583110000043
Figure FDA0003307583110000044
The method is used for representing the correlation between the entity2 and the entity1 in the text sequence, when the EMI is larger than zero, the probability of occurrence is larger if the value is larger, and when the EMI is smaller than zero, the two entities are mutually exclusive, so that the relevance measurement between the text entities is defined in each field; and finally, in the using process of the system, expanding and associating a new text entity to the domain entity according to the real data of the conversation, and splitting and generating the domains based on an unsupervised clustering method if the conversation repeatedly jumps among a plurality of domains.
7. The multimodal knowledge-graph-based of claim 1The speech adaptive completion system is characterized in that the candidate text prediction, namely the candidate text prediction based on the space-time knowledge map, specifically comprises the following steps: forming a space-time knowledge graph network by using the time sequence characteristics of historical texts, reasoning the joint probability representation P (w) of the current candidate texts, and splicing the P (w) with the probability higher than a certain threshold value into candidate text vectors, thereby realizing pruning of a solution space, and specifically comprising the following steps: the formed historical text can be regarded as a path which walks among knowledge graph entities and is represented by a combination of binary vectors (from, to) of a plurality of text entities; due to the real-time property of audio and video calls, the future cannot be predicted accurately at the current moment of the system, and under the condition that phoneme recognition is correct, a text entity with really proper semantics cannot be predicted, so that a plurality of alternative paths with smaller weight are reserved on other entity nodes before a path tail end entity node for semantic backtracking when inference errors occur, and a path wandering graph G at the current moment t is formedTOn the basis of the multi-mode knowledge graph, the multi-mode knowledge graph is overlapped with a path walking graph at each moment in the past in a time dimension to form a space-time knowledge graph G, wherein a space refers to a solution space; to improve reasoning efficiency, based on one assumption: the path walk graph at the T moment only depends on the path walk graph of the previous s time steps, and the spatio-temporal knowledge graph network is trained to optimize and reason the joint probability distribution of G
Figure FDA0003307583110000045
Wherein P (G)t|Gt-s:t-1) Can be split into
Figure FDA0003307583110000046
Figure FDA0003307583110000047
Further, the air conditioner is provided with a fan,
Figure FDA0003307583110000048
Figure FDA0003307583110000049
Figure FDA00033075831100000410
representing all the neighbor entity nodes from _ t has two benefits: the candidate text space is covered, and the capability of frequent pattern mining is realized; aiming at the formula, the text-oriented space-time diagram neural network is established based on the recurrent neural network RNN, and the parameterization step of the formula is required to be carried out as follows:
Figure FDA0003307583110000051
efrom_tis a learnable vector, h, associated with from _ tt-1(from _ t) is the historical semantic vector for from _ t, ωto_tIs the classifier parameter, similarly, P (from _ t | G)t-s:t-1)→exp(Ht-1 T·ωfrom_t),Ht-1 THistory semantic vectors of the full-walking path are obtained; on the basis of which are
Figure FDA0003307583110000052
Figure FDA0003307583110000053
And Ht=RNN2(g(Gt),Ht-1) I.e. the historical semantic vector is recursively updated through the RNN, g is an aggregation function with an attention mechanism, and the importance weight of the neighbor entity node to the from entity node is learned through an attention matrix.
8. The multi-modal knowledge-graph based speech adaptive completion system according to claim 4, wherein said matching the recognized phonemes to the text in the solution space is: according to phonemes corresponding to the text, performing Cartesian product on the phoneme inference vector, the domain text vector and the candidate text vector to obtain a text solution space of the intersected phonemes, calculating the probability of each text in the solution space through a formula P (a) P (w) EMI (e, w), using the text with the maximum value as a completed text as a voice completion, using the text with the same phoneme ranking three times as an alternative text together with the alternative text to form an alternative path with a smaller weight, and using the alternative path as semantic backtracking possibly generated during the next text inference.
9. The multi-modal knowledge-graph-based speech adaptive completion system according to claim 1, wherein the semantic sentences are obtained by: setting a threshold value T and a confidence coefficient alpha, adding the probabilities of T text entities at the tail end of each alternative path, splicing the paths with the maximum probability to generate a sentence, and simultaneously contracting the paths with the result smaller than the alpha T to ensure that the graph only contains the paths with the result larger than the alpha T, thereby realizing the further pruning of the path wandering graph at the current moment and optimizing the efficiency of the continuous reasoning of the system.
10. The multi-modal knowledge-graph based speech adaptive completion system according to claim 4, wherein the synthesizing of the missing speech is: converting a text sequence with a timestamp into a phoneme sequence with a timestamp, learning the mapping from phonemes to waveform characteristics according to the waveform characteristics of the voice of a user through a TTS model, and reversely converting the characteristics into waveforms through a vocoder;
the voice splicing means that: the existing voice waveform and the complementary voice waveform are spliced after being subjected to telescopic fitting near the connection point, so that the voice transition is smoother and more natural.
CN202111207821.9A 2021-10-18 2021-10-18 Voice self-adaptive completion system based on multi-mode knowledge graph Pending CN113936637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111207821.9A CN113936637A (en) 2021-10-18 2021-10-18 Voice self-adaptive completion system based on multi-mode knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111207821.9A CN113936637A (en) 2021-10-18 2021-10-18 Voice self-adaptive completion system based on multi-mode knowledge graph

Publications (1)

Publication Number Publication Date
CN113936637A true CN113936637A (en) 2022-01-14

Family

ID=79280039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111207821.9A Pending CN113936637A (en) 2021-10-18 2021-10-18 Voice self-adaptive completion system based on multi-mode knowledge graph

Country Status (1)

Country Link
CN (1) CN113936637A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666449A (en) * 2022-03-29 2022-06-24 深圳市银服通企业管理咨询有限公司 Voice data processing method of calling system and calling system
CN114817456A (en) * 2022-03-10 2022-07-29 马上消费金融股份有限公司 Keyword detection method and device, computer equipment and storage medium
CN114880484A (en) * 2022-05-11 2022-08-09 军事科学院系统工程研究院网络信息研究所 Satellite communication frequency-orbit resource map construction method based on vector mapping
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery
CN115953521A (en) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 Remote digital human rendering method, device and system
CN116071740A (en) * 2023-03-06 2023-05-05 深圳前海环融联易信息科技服务有限公司 Invoice identification method, computer equipment and storage medium
CN116580701A (en) * 2023-05-19 2023-08-11 国网物资有限公司 Alarm audio frequency identification method, device, electronic equipment and computer medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817456A (en) * 2022-03-10 2022-07-29 马上消费金融股份有限公司 Keyword detection method and device, computer equipment and storage medium
CN114817456B (en) * 2022-03-10 2023-09-05 马上消费金融股份有限公司 Keyword detection method, keyword detection device, computer equipment and storage medium
CN114666449A (en) * 2022-03-29 2022-06-24 深圳市银服通企业管理咨询有限公司 Voice data processing method of calling system and calling system
CN114880484A (en) * 2022-05-11 2022-08-09 军事科学院系统工程研究院网络信息研究所 Satellite communication frequency-orbit resource map construction method based on vector mapping
CN114880484B (en) * 2022-05-11 2023-06-16 军事科学院系统工程研究院网络信息研究所 Satellite communication frequency track resource map construction method based on vector mapping
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery
CN115860152B (en) * 2023-02-20 2023-06-27 南京星耀智能科技有限公司 Cross-modal joint learning method for character military knowledge discovery
CN116071740A (en) * 2023-03-06 2023-05-05 深圳前海环融联易信息科技服务有限公司 Invoice identification method, computer equipment and storage medium
CN115953521A (en) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 Remote digital human rendering method, device and system
CN116580701A (en) * 2023-05-19 2023-08-11 国网物资有限公司 Alarm audio frequency identification method, device, electronic equipment and computer medium
CN116580701B (en) * 2023-05-19 2023-11-24 国网物资有限公司 Alarm audio frequency identification method, device, electronic equipment and computer medium

Similar Documents

Publication Publication Date Title
CN113936637A (en) Voice self-adaptive completion system based on multi-mode knowledge graph
Petridis et al. Audio-visual speech recognition with a hybrid ctc/attention architecture
Ferstl et al. Multi-objective adversarial gesture generation
CN113408385B (en) Audio and video multi-mode emotion classification method and system
CN107423398B (en) Interaction method, interaction device, storage medium and computer equipment
CN112053690B (en) Cross-mode multi-feature fusion audio/video voice recognition method and system
Sadoughi et al. Speech-driven expressive talking lips with conditional sequential generative adversarial networks
CN101187990A (en) A session robotic system
US20040056907A1 (en) Prosody based audio/visual co-analysis for co-verbal gesture recognition
CN110634469B (en) Speech signal processing method and device based on artificial intelligence and storage medium
CN110853656B (en) Audio tampering identification method based on improved neural network
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN104541324A (en) A speech recognition system and a method of using dynamic bayesian network models
Elakkiya et al. Subunit sign modeling framework for continuous sign language recognition
CN111461173A (en) Attention mechanism-based multi-speaker clustering system and method
CN112329438A (en) Automatic lie detection method and system based on domain confrontation training
Zhang Voice keyword retrieval method using attention mechanism and multimodal information fusion
CN114610158A (en) Data processing method and device, electronic equipment and storage medium
CN112749567A (en) Question-answering system based on reality information environment knowledge graph
CN116955699A (en) Video cross-mode search model training method, searching method and device
D’Ulizia Exploring multimodal input fusion strategies
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
Hussain et al. Deep learning for audio visual emotion recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination