CN118098222B - Voice relation extraction method, device, computer equipment and storage medium - Google Patents

Voice relation extraction method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN118098222B
CN118098222B CN202410524510.2A CN202410524510A CN118098222B CN 118098222 B CN118098222 B CN 118098222B CN 202410524510 A CN202410524510 A CN 202410524510A CN 118098222 B CN118098222 B CN 118098222B
Authority
CN
China
Prior art keywords
text
loss
feature
voice
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410524510.2A
Other languages
Chinese (zh)
Other versions
CN118098222A (en
Inventor
张亮
杨振
孟凡东
苏劲松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410524510.2A priority Critical patent/CN118098222B/en
Publication of CN118098222A publication Critical patent/CN118098222A/en
Application granted granted Critical
Publication of CN118098222B publication Critical patent/CN118098222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The disclosure provides a voice relation extraction method, a voice relation extraction device, computer equipment and a storage medium. The method comprises the steps of obtaining voice data; performing voice feature extraction on the target voice data based on the first neural network model to obtain voice features; inputting the voice characteristics into a second neural network model to perform characteristic mode conversion to obtain text characteristics; performing feature decoding on the text features based on the third neural network model to obtain a voice relation text of the target voice data; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on the first sample text feature and the second sample text feature; the second loss is calculated based on the predicted phonetic relationship text and the phonetic relationship label. The method can improve the accuracy of voice relation extraction.

Description

Voice relation extraction method, device, computer equipment and storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence, and in particular relates to a voice relation extraction method, a device, computer equipment and a storage medium.
Background
Speech translation is a service that converts speech content in a speech file into text in another language. The service has great effect on cross-language communication, international conference, auxiliary equipment for people with hearing impairment and the like.
In the process of speech translation, it is generally necessary to perform speech relation extraction so as to better mine knowledge in speech. The voice relation extraction aims at extracting relation triples (such as a head entity, a tail entity and a relation) from voice, so that more accurate translation results can be output in an assisted mode.
At present, the voice relation extraction method adopted in the related technology has low accuracy of the extracted voice relation.
Disclosure of Invention
The embodiment of the disclosure provides a voice relation extraction method, a device, computer equipment and a storage medium.
The first aspect of the present disclosure provides a method for extracting a voice relationship, the method comprising:
a method for extracting a speech relationship, the method comprising:
acquiring target voice data to be subjected to voice relation extraction;
Extracting voice characteristics of the target voice data based on a first neural network model to obtain voice characteristics;
inputting the voice characteristics into a second neural network model for characteristic mode conversion to obtain text characteristics;
performing feature decoding on the text features based on a third neural network model to obtain a voice relation text of the target voice data;
Wherein the first neural network model, the second neural network model, and the third neural network model are obtained based on a target loss joint training, the target loss including a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through the first neural network model and then performing feature mode conversion through the second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on a predicted voice relation text and a voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
A second aspect of the present disclosure provides a voice relationship extraction apparatus, the apparatus comprising:
the acquisition unit is used for acquiring target voice data to be subjected to voice relation extraction;
The extraction unit is used for extracting the voice characteristics of the target voice data based on the first neural network model to obtain the voice characteristics;
The conversion unit is used for inputting the voice characteristics into a second neural network model to perform characteristic mode conversion to obtain text characteristics;
The decoding unit is used for performing feature decoding on the text features based on a third neural network model to obtain a voice relation text of the target voice data;
Wherein the first neural network model, the second neural network model, and the third neural network model are obtained based on a target loss joint training, the target loss including a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through the first neural network model and then performing feature mode conversion through the second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on a predicted voice relation text and a voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
Optionally, in some embodiments, the joint training process of the first neural network model, the second neural network model, and the third neural network model is implemented by a training unit comprising:
the training system comprises an acquisition subunit, a training unit and a processing unit, wherein the acquisition subunit is used for acquiring training sample data, and the training sample data comprises a plurality of sample voices and voice relation labels corresponding to each sample voice;
the first extraction subunit is used for extracting the voice characteristics of the sample voice based on the first neural network model to obtain sample voice characteristics, and carrying out characteristic modal conversion on the sample voice characteristics based on the second neural network model to obtain the first sample text characteristics;
The second extraction subunit is used for carrying out voice recognition on the sample voice to obtain a corresponding sample text, carrying out text feature extraction on the sample text based on a fourth neural network model to obtain the second sample text feature, and enabling the output feature of the fourth neural network model to be consistent with the input feature size of the third neural network model;
a first calculating subunit, configured to calculate a join timing classification loss according to the first sample text feature and the second sample text feature, to obtain a first loss;
the first decoding subunit is used for carrying out feature decoding on the first text sample feature based on the third neural network model to obtain a first predicted voice relation text, and calculating to obtain a second loss according to the first predicted voice relation text and the voice relation label;
And the updating subunit is used for updating the model parameters of the first neural network model, the second neural network model and the third neural network model based on the first loss and the second loss.
Optionally, in some embodiments, updating the subunit includes:
the acquisition module is used for acquiring a first weight coefficient corresponding to the first loss and acquiring a second weight coefficient corresponding to the second loss;
the first calculation module is used for carrying out weighted calculation on the first loss and the second loss based on the first weight coefficient and the second weight coefficient to obtain a first target loss;
And the first updating module is used for updating model parameters of the first neural network model, the second neural network model and the third neural network model according to the first target loss.
Optionally, in some embodiments, the training unit further comprises:
the first recognition subunit is used for recognizing entity texts in the sample texts and recognizing entity features corresponding to the entity texts in the second sample text features;
The conversion subunit is used for carrying out modal conversion on the entity characteristics in the second sample text characteristics based on the first sample text characteristics to obtain mixed modal characteristics;
The second decoding subunit is used for carrying out feature decoding on the mixed mode features based on the third neural network model to obtain a second predicted voice relation text, and carrying out feature decoding on the second sample text features based on the third neural network model to obtain a third predicted voice relation text;
a second calculation subunit configured to calculate a third loss according to the second predicted speech relationship text and the third predicted speech relationship text;
the update subunit is further configured to:
model parameters of the first, second, and third neural network models are updated based on the first, second, and third losses.
Optionally, in some embodiments, the transformant unit comprises:
the second calculation module is used for carrying out attention calculation on each entity characteristic and the first sample characteristic to obtain an entity voice characteristic corresponding to each entity;
And the replacing module is used for replacing the entity characteristic corresponding to each entity in the second sample text characteristic with the corresponding entity voice characteristic to obtain the mixed mode characteristic.
Optionally, in some embodiments, the training unit further comprises:
A third calculation subunit, configured to calculate a fourth loss according to the second predicted speech relationship text and the speech relationship tag, and calculate a fifth loss according to the third predicted speech relationship text and the speech relationship tag;
the update subunit includes:
a third calculation module for calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss;
And a second updating module, configured to update model parameters of the first neural network model, the second neural network model, and the third neural network model according to the second target loss.
Optionally, in some embodiments, the training unit further comprises:
The compression subunit is used for inputting the first sample text feature into a fifth neural network model for feature compression to obtain a first sentence feature, and inputting the second sample text feature into the fifth neural network model for feature compression to obtain a second sentence feature;
the projection subunit is used for carrying out semantic projection on the first sentence characteristic based on a sixth neural network model to obtain a third sentence characteristic, and carrying out semantic projection on the second sentence characteristic based on the sixth neural network model to obtain a fourth sentence characteristic;
A fourth calculation subunit, configured to calculate a sixth loss according to the third sentence feature and the fourth sentence feature;
the third computing module is further configured to:
calculating a second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, and the sixth loss.
Optionally, in some embodiments, the training unit further comprises:
The third decoding subunit is used for carrying out feature decoding on the third sentence feature based on the third neural network model to obtain a fourth predicted voice relation text, and carrying out feature decoding on the fourth sentence feature based on the third neural network model to obtain a fifth predicted voice relation text;
A fifth calculation subunit configured to calculate a seventh loss according to the fourth predicted speech relationship text and the fifth predicted speech relationship text;
the third computing module is further configured to:
Calculating a second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss, and the seventh loss.
Optionally, in some embodiments, the third computing module includes:
The acquisition sub-module is used for acquiring training rounds;
The first calculation sub-module is used for calculating a third weight coefficient corresponding to each loss based on the training round;
And the second calculation submodule is used for carrying out weighted calculation on the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss and the seventh loss based on the third weight coefficient corresponding to each loss to obtain a second target loss.
Optionally, in some embodiments, the training unit further comprises:
The first processing subunit is used for inputting the first text feature into a fifth neural network model for feature compression, inputting the compressed feature into the sixth neural network model for semantic projection to obtain a fifth sentence feature;
The second processing subunit is used for inputting the text features of the second sample into a fifth neural network model for feature compression, inputting the compressed features into a sixth neural network model for semantic projection to obtain sixth sentence features;
a sixth calculation subunit, configured to calculate an eighth loss according to the fifth sentence feature and the sixth sentence feature;
the update subunit is further configured to:
Updating model parameters of the first, second, and third neural network models based on the first, second, and eighth losses
Optionally, in some embodiments, the voice relation extracting apparatus provided by the present disclosure further includes:
the second recognition subunit is used for carrying out language recognition on the target voice data to obtain a language description text;
The third extraction subunit is used for extracting text features of the language description text to obtain language text features;
the splicing subunit is used for splicing the language text features and the text features to obtain target text features;
The decoding unit is further configured to:
And performing feature decoding on the target text features based on a third neural network model to obtain the voice relation text of the target voice data.
A third aspect of the present disclosure provides a storage medium storing a computer program which, when executed by a processor, implements the speech relation extraction method according to the first aspect.
A fourth aspect of the present disclosure provides a computer device comprising a memory storing a computer program and a processor implementing the speech relation extraction method according to the first aspect when the processor executes the computer program.
A fifth aspect of the present disclosure provides a computer program product comprising a computer program, the computer program being readable and executable by a processor of a computer device to cause the computer device to perform the speech relation extraction method according to the first aspect.
According to the voice relation extraction method provided by the embodiment of the disclosure, target voice data to be subjected to voice relation extraction is obtained; performing voice feature extraction on the target voice data based on the first neural network model to obtain voice features; inputting the voice characteristics into a second neural network model to perform characteristic mode conversion to obtain text characteristics; performing feature decoding on the text features based on the third neural network model to obtain a voice relation text of the target voice data; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
According to the embodiment of the disclosure, the first loss obtained by calculating the first text characteristics output according to the second neural network model and the second text characteristics obtained by extracting the text characteristics of the sample text corresponding to the sample voice is adopted to guide the second neural network model to learn, so that the characteristics obtained by converting the characteristics of the voice characteristics by the second neural network model are more similar to the text characteristics directly extracted from the text corresponding to the voice. Thus, the mode difference between the voice characteristics extracted from the voice data and the text characteristics to be extracted in the voice relation is relieved by training the second neural network model which can accurately convert the characteristic modes from the voice characteristics to the text characteristics of the voice data. Compared with the prior art that the voice features are converted into the features with the same length as the text features, the second neural network model provided by the application can effectively relieve the modal difference in the process of converting the voice features into the text features, so that the accuracy of the text features obtained by modal conversion can be improved, and the accuracy of voice relation extraction can be further improved.
Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.
FIG. 1 is a system architecture diagram to which a speech relationship extraction method of an embodiment of the present disclosure is applied;
FIG. 2 is a schematic flow chart of a method for extracting a speech relationship provided in the present disclosure;
FIG. 3 is a schematic diagram of speech relationship extraction based on artificial intelligence techniques in the related art;
fig. 4 is a schematic diagram of a scheme corresponding to a voice relationship extraction method provided in the present disclosure;
FIG. 5 is a schematic illustration of the speech relationship extraction model training provided by the present disclosure;
FIG. 6 is another schematic illustration of the speech relationship extraction model training provided by the present disclosure;
FIG. 7 is another flow chart of the voice relationship extraction method provided by the present disclosure;
FIG. 8 is an overall schematic diagram of a method for extracting a speech relationship provided by the present disclosure;
fig. 9 is a schematic structural diagram of a voice relation extracting apparatus according to an embodiment of the present disclosure;
FIG. 10 is a block diagram of a terminal implementing methods according to one embodiment of the present disclosure;
Fig. 11 is a server block diagram implementing methods according to one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.
Before proceeding to further detailed description of the disclosed embodiments, the terms and terms involved in the disclosed embodiments are described, which are applicable to the following explanation:
phonetic relation extraction (Speech Relation Extraction, speechRE): refers to the task of extracting a relationship triplet from speech to better mine knowledge in the speech.
Relationship triplet (Relation Triple, RT): is a basic unit for representing the relationship between entities, and is generally used in the fields of Knowledge Graph (KG), semantic Web (SW) and the like. The relationship triplet consists of three parts: a head entity (HEAD ENTITY), a relationship (Relation), and a tail entity (TAIL ENTITY). The general form of a relationship triplet is: (head entity, relationship, tail entity). Wherein: a header entity represents an entity in a relationship, typically a particular object, concept or event. The relationship represents an associative relationship between a head entity and a tail entity, and is typically a verb, noun, or adjective for describing a relationship between entities. The tail entity represents another entity in the relationship, corresponding to the head entity.
Bidirectional autoregressive converter (Bidirectional and Auto-REGRESSIVE TRANSFORMERS, BART): the BART model is a pre-trained model based on a sequence-to-sequence (Seq 2 Seq) structure that combines bi-directional and auto-regressive transformations that can be used for text generation and understanding. The BART model uses a standard transducer-based neural machine translation architecture, the pre-training of which involves breaking down text using noise functions and learning a sequence-to-sequence model to reconstruct the original text. The main advantage of these pre-training steps is that the model can flexibly process the original input text and learn to reconstruct the text efficiently. The BART model uses a bi-directional transducer as the encoder and an autoregressive transducer as the decoder.
Relationship extraction generally refers to extracting semantic relationships between entities from text data, however, as the amount of speech data increases rapidly, extracting relationships from speech (speech relationship extraction) is also becoming a problem to explore. The input for the speech relationship extraction is the original audio and the output is one or more relationship triples, each relationship triplet representing a relationship between a pair of entities present in the speech. Currently, some methods first use automatic speech recognition (Automatic Speech Recognition, ASR) technology to perform speech-text conversion on the original audio, and convert the speech relationship extraction task into a text relationship extraction task (extracting semantic relationships between entities from text). However, since the information contained in the speech contains, in addition to the text recognized by the ASR, other information such as emotion information, volume information, and speech speed information, the speech is recognized by the ASR to convert the speech relationship extraction task into the text relationship extraction task, so that the complete semantic information cannot be extracted from the speech, and thus an accurate relationship triplet cannot be extracted.
In the related art, a method for directly extracting a relation from voice data generally adopts a voice encoder to extract voice characteristics from the voice data, then carries out length conversion on the extracted voice characteristics through a convolutional neural network (Convolutional Neural Networks, CNN) to obtain characteristics with characteristic lengths suitable for a text decoder to decode, and then adopts the text decoder to decode the characteristics with the length converted to obtain a relation triplet of the voice data. However, since the audio encoder and the text decoder are pre-trained on data of different modalities, there is a significant modal difference (gap) between the audio features output by the audio encoder and the input features required by the text decoder, while the CNN adopted in the related art focuses only on the length of the compressed audio features to be suitable for the text decoder to process, and the modal gap between the audio features output by the audio encoder and the input features required by the text decoder cannot be relieved, so that the accuracy of the relation triples extracted by the method for directly extracting the relation from the voice data in the related art is not high.
In order to solve the problem of low accuracy of voice relation extraction in the above-mentioned scene, the present disclosure provides a voice relation extraction method, which can improve the accuracy of voice relation extraction.
Fig. 1 is a system architecture diagram to which a voice relationship extraction method according to an embodiment of the present disclosure is applied. It includes a terminal 140, the internet 130, a gateway 120, a server 110, etc.
The terminal 140 includes a desktop computer, a laptop computer, a PDA (personal digital assistant), a mobile phone, a car-mounted terminal, a home theater terminal, a dedicated terminal, an intelligent voice interaction device, an intelligent home appliance, or an aircraft, etc., in the form of a plurality of devices having a display screen. In addition, the device can be a single device or a set of a plurality of devices. The terminal 140 may communicate with the internet 130 in a wired or wireless manner, exchanging data. In the disclosed embodiment, the terminal 140 may be loaded with a client of the distributed database, or with other applications that may access the distributed data.
Server 110 refers to a computer system that can provide certain services to terminal 140. The server 110 is required to have higher stability, security, performance, etc. than the general terminal 140. The server 110 may be one high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of one high-performance computer (e.g., a virtual machine), a combination of portions of multiple high-performance computers (e.g., virtual machines), etc. In the embodiment of the present disclosure, the server 110 specifically provides a storage function, that is, the server 110 may be a database node of a distributed database, where a plurality of servers 110 form a cluster of the distributed database.
Gateway 120 is also known as an intersubnetwork connector, protocol converter. The gateway implements network interconnection on the transport layer, and is a computer system or device that acts as a translation. The gateway is a translator between two systems using different communication protocols, data formats or languages, and even architectures that are quite different. At the same time, the gateway may also provide filtering and security functions. The message sent by the terminal 140 to the server 110 is to be sent to the corresponding server 110 through the gateway 120. A message sent by the server 110 to the terminal 140 is also sent to the corresponding terminal 140 through the gateway 120. In the embodiment of the present disclosure, the terminal 140 sends a data access request to the server 110 through the gateway 120, and the server 110 returns a data access result to the terminal 140 through the gateway 120.
The voice relation extraction method provided by the embodiment of the present disclosure may be implemented in the terminal 140 or may be implemented in the server 110; in some embodiments, the voice relationship extraction method may be implemented in part in terminal 140 and in part in server 110.
When the voice relation extraction method provided by the embodiment of the present disclosure is implemented in the terminal 140, the terminal 140 obtains target voice data to be subjected to voice relation extraction; then, extracting voice characteristics of the target voice data based on a first neural network model deployed in the terminal 140 to obtain voice characteristics; performing feature mode conversion on the second neural network model deployed in the voice feature input terminal 140 to obtain text features; feature decoding is carried out on the text features based on a third neural network model deployed in the terminal 140, so as to obtain a voice relation text of the target voice data; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
When the voice relation extraction method provided by the embodiment of the present disclosure is implemented in the server 110, the server 110 acquires target voice data to be subjected to voice relation extraction; then, extracting voice characteristics of the target voice data based on a first neural network model deployed in the server 110 to obtain voice characteristics; inputting the voice characteristics into a second neural network model deployed in the server 110 to perform characteristic mode conversion to obtain text characteristics; feature decoding is carried out on the text features based on a third neural network model deployed in the server 110, so that a voice relation text of the target voice data is obtained; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
When the voice relationship extraction method provided in the embodiment of the present disclosure is partially implemented in the terminal 110 and partially implemented in the server 110, the terminal 140 obtains target voice data to be subjected to voice relationship extraction and sends the obtained target voice data to the server 110 for voice relationship extraction; then, the server 110 performs voice feature extraction on the target voice data based on the first neural network model deployed in the server 110 to obtain voice features; inputting the voice characteristics into a second neural network model deployed in the server 110 to perform characteristic mode conversion to obtain text characteristics; feature decoding is carried out on the text features based on a third neural network model deployed in the server 110, so that a voice relation text of the target voice data is obtained; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model. Finally, the server 110 transmits the extracted voice relationship text to the terminal 140.
The voice relation extraction method provided by the embodiment of the disclosure can be applied to various scenes such as a voice recognition scene, a voice analysis scene, an intelligent assistant voice interaction scene and the like.
For example, when the voice relation extraction method provided by the application is used in a voice recognition scene, voice relations can be directly extracted from received voices (such as dialogue voices, conference voices or interview voices) to obtain one or more groups of relation triples, and then the voices recognized by adopting an ASR technology can be guided or corrected based on the one or more groups of relation triples obtained by extracting the voice relations, so that more accurate voice recognition results can be obtained, such as more accurate dialogue texts, conference recorded texts, interview recorded texts and the like.
The above examples do not limit the scope of protection of the present disclosure.
According to one embodiment of the present disclosure, a speech relationship extraction method is provided. Fig. 2 is a schematic flow chart of a voice relation extraction method provided in the present disclosure. The method may be applied to a speech relation extraction device, which may be integrated in a computer device, which may in particular be a terminal. The voice relation extraction method may include:
step 210, obtaining target voice data to be subjected to voice relation extraction.
In the voice relation extraction method provided by the embodiment of the present disclosure, as described above, the task of processing is a task of extracting a relation triplet from voice, where the relation triplet includes entities and relations between entities. Thus, the target voice data to be subjected to voice relation extraction referred to in the present application may specifically be human-generated voice data or human-generated simulated voice data, and may more specifically be spoken voice data. Namely, the voice relation extraction method provided by the application can also be called a spoken relation extraction method (SpeechRE).
The target voice data may include spoken data related to different fields of different scenes, such as chat spoken data in a daily conversation scene, conference spoken data in an academic conference scene, narrative spoken data in an event narrative scene, spoken data read aloud in an audio book application, spoken data in a debate event, spoken data generated by an intelligent voice assistant, and so on. The above scenario is merely an example, and the object processed by the voice relation extraction method provided by the present application is not limited, and in the present case, the target voice data to be subjected to voice relation extraction may also be spoken data generated in other scenarios.
The target voice data to be extracted in the voice relation extraction method can be obtained through a plurality of possible channels. For example, the target voice data to be subjected to voice relation extraction may be acquired by an active acquisition manner. For example, in academic conference scenes, conference spoken language data can be collected in real time through a voice collection device, and then the collected conference spoken language data is subjected to real-time voice relation extraction. Or the spoken language materials can be obtained from some preset spoken language databases, and then the spoken language materials are subjected to voice relation extraction. For another example, the target voice data to be subjected to voice relation extraction may be acquired by a passive reception manner. For example, an application dedicated to providing a voice relation extraction service may provide a spoken language input interface through which a user object of the application can input target voice data to be subjected to voice relation extraction into the application. After receiving the target voice data input by the application, the voice relation extraction method provided by the application can be adopted to extract the voice relation of the received target voice data.
And 220, extracting voice characteristics of the target voice data based on the first neural network model to obtain the voice characteristics.
The embodiment of the disclosure provides a voice relation extraction method, in particular to a voice relation extraction method based on artificial intelligence. Compared with a voice relation extraction method based on artificial intelligence in the related art, the method and the device for extracting the relation triples in the text feature conversion process of the voice feature are particularly improved, and the problem that the relation triples obtained through final extraction are not accurate enough due to the fact that mode gap of two features is not relieved in the voice feature to text feature conversion process in the related art is solved.
FIG. 3 is a schematic diagram of speech relationship extraction based on artificial intelligence technology in the related art. As shown in the figure, in the related art, after obtaining the voice data to be subjected to the voice relation extraction, a voice encoder 310 (or called a first neural network model) is adopted to perform voice feature extraction on the voice data to obtain voice features; then, a length adapter 320 (CNN) is used to perform length conversion on the extracted voice feature to obtain a converted feature matching the length of the input feature of the text decoder 330 (or called the third neural network model); the conversion characteristics are then input into the text decoder 330 for decoding to obtain the relationship triples. Where speech encoder 310 is typically pre-trained with speech data and text decoder 330 is typically pre-trained with text data due to the fact that speech encoder 310 and text decoder 330 are pre-trained with data of different modalities, this results in a significant modality gap between the features processed by both. If only one length adapter is used for carrying out length adaptation on the features which are output and to be output, the mode gap between the features cannot be relieved, namely the converted features obtained after the length adaptation cannot accurately express the relation information in the voice data, and further the extracted relation triples are insufficient in accuracy.
To alleviate to some extent the modal gap between speech features output by speech encoder 310 and text features required to be input by text decoder 330, embodiments of the present disclosure provide a pair Ji Shi adapter and new training method. Fig. 4 is a schematic diagram of a scheme corresponding to the voice relation extraction method provided in the present disclosure. As shown in the figure, the voice relation extraction method provided by the application replaces the length adapter 320 in the related art with the Ji Shi adapter 410 (or called the second neural network model), and additionally introduces a text encoder 420 (or called the fourth neural network model) in the training process of the model, and the text encoder 420 and the text decoder 330 can be matched models, that is, the mode of the characteristics output by the text encoder 420 is consistent with the mode of the characteristics required to be input by the text decoder 330. Specifically, for example, text encoder 420 in embodiments of the present disclosure may be a bi-directional encoder (Bidirectional Encoder) in the BART model and text decoder 330 may be an autoregressive decoder (Autoregressive Decoder) in the BART model. During model training, the conversion characteristics obtained by converting the voice characteristics extracted by the voice encoder 310 by the Ji Shi adapter 410 and the characteristics output by the text encoder 420 are subjected to construction loss, so that the difference between the characteristics obtained by converting the Ji Shi adapter 410 and the text encoder is relieved, and further, the Ji Shi adapter 410 can learn to obtain the modal gap for relieving the output characteristics of the voice encoder 310 and the input characteristics required by the text decoder 330, and further, the accuracy of voice relation extraction is improved. The speech encoder 310 may specifically employ a pre-trained wav2vec2.0 model, among other things.
The specific structure of the pair Ji Shi adapter 410 provided in the embodiments of the present disclosure may be a CNN structure, or may be a combined structure including a CNN and other network layers. After the above model structure is constructed and the corresponding training loss is designed, training sample data may be obtained to jointly train the speech encoder 310, the pair Ji Shi adapter 410, the text decoder 330, and the text encoder 420. After model training is completed and online deployed, the model training can be used for processing a voice relation extraction task. Specifically, when the trained model is deployed online, after the target voice data to be extracted in the voice relationship is obtained, the first trained neural network model (i.e., the voice encoder 310) may be used to extract the voice features of the target voice data first, so as to obtain the corresponding voice features.
Step 230, inputting the voice feature into the second neural network model for feature mode conversion to obtain text feature.
Further, the trained second neural network model (i.e., the pair Ji Shi adapter 410) may be used to perform feature-mode conversion on the speech features extracted from the first neural network model to convert the speech features extracted from the first neural network model into text features required to be input by the third neural network model (i.e., the text decoder 330).
In the embodiment of the disclosure, the process of performing alignment adaptation on the speech features based on the trained second neural network model may be performed from multiple layers. In particular, alignment adaptation of multiple layers can be achieved through design of the loss function in the process of training the second neural network model. The training process of the model will be described in detail below taking the overall network structure as shown in fig. 4 as an example.
Wherein, in the embodiment of the present disclosure, the process of jointly training the speech encoder 310, the pair Ji Shi adapter 410, the text decoder 330 and the text encoder 420 may specifically include the following steps:
acquiring training sample data, wherein the training sample data comprises a plurality of sample voices and voice relation labels corresponding to each sample voice;
extracting voice features of the sample voice based on the first neural network model to obtain sample voice features, and performing feature mode conversion on the sample voice features based on the second neural network model to obtain first sample text features;
Performing voice recognition on the sample voice to obtain a corresponding sample text, and extracting text features of the sample text based on a fourth neural network model to obtain second sample text features, wherein the output features of the fourth neural network model are consistent with the input feature sizes of the third neural network model;
Calculating a joint time sequence classification loss according to the first sample text characteristic and the second sample text characteristic to obtain a first loss;
Performing feature decoding on the first text sample feature based on a third neural network model to obtain a first predicted voice relation text, and calculating to obtain a second loss according to the first predicted voice relation text and the voice relation label;
Model parameters of the first, second, and third neural network models are updated based on the first and second losses.
In the embodiment of the disclosure, the training sample data may obtain the input text and the corresponding tag data from some known text relation extraction data sets, and then the input text is synthesized by voice to obtain sample voice, and the tag data corresponding to the sample voice may be determined as a voice relation tag. Thus, for each set of training sample data, the sample speech, the corresponding sample text (i.e., the input text) and the corresponding speech relationship label are included.
When training the model based on the acquired sample data, the speech encoder may first perform speech feature extraction on the sample speech, and then perform modal conversion on the extracted speech feature by using the Ji Shi adapter to obtain the first sample feature. Meanwhile, the text encoder can be based on the text of the sample corresponding to the sample voice to extract text features, so that second sample text features are obtained. As described above, the text encoder and the text decoder are corresponding models, and the features output by the text encoder are consistent with the dimensions of the features required to be input by the text decoder.
Further, a first penalty may be calculated using the first sample text feature output by the Ji Shi adapter and the second sample text feature output by the text encoder. Specifically, the first penalty here may be a calculated join timing classification (Connectionist Temporal Classification, CTC) penalty (L CTC). CTC is one way to avoid manual alignment of input and output, and is well suited for relationship processing between speech features and text features. For example, given an input sequence x= [ X1, X2, xn ] and corresponding tag data y= [ Y1, Y2, ym ], specifically for example an audio file and corresponding text tag in speech recognition, an algorithm for classifying time series data such as finding a mapping from X to Y becomes a time series classification. CTC gives the output distribution of all possible Y for a given X, from which the most probable result can be output or gives the probability of a certain output. For CTC loss, it is desirable to maximize the posterior probability P (Y X) of Y given the input sequence X. In the embodiment of the present application, for a batch (batch) of training samples, a text encoder may be first used to extract text features corresponding to sample text of each sample in the batch, where the text features are word (token) features. Then, all token features in the batch are spliced, dot product calculation is carried out on each first sample feature output by the Ji Shi adapter and the spliced features, and alignment distribution between a plurality of first sample features and a plurality of second sample text features in the batch is obtained. Further, CTC losses can be calculated based on the alignment distribution, and word-level feature alignment (Token-LEVEL ALIGNMENT) can be accomplished by minimizing the CTC losses.
In addition, in addition to minimizing the difference between the first and second sample text features, there is a need to minimize the difference between the speech relationship prediction result output by the text decoder and the speech relationship label during training; i.e. classification losses need to be taken into account in addition to CTC losses. Specifically, the first text sample feature may be further feature decoded based on the text decoder to obtain a first predicted speech relationship text, and the second loss is calculated based on the first predicted speech relationship text and the speech relationship tag after accommodation. Wherein the second penalty here may be to calculate the cross entropy of the first predicted phonetic relationship text with the phonetic relationship label (L CE (s)).
For a sample of the batch, after the first loss and the second loss are calculated, model parameters of the speech encoder, the Ji Shi adapter, and the text decoder can be updated based on the two losses. Then, a sample of another batch can be obtained to carry out the next round of iterative training until the preset iteration termination condition is met, and the training of the model is completed.
In some embodiments, updating model parameters of the first, second, and third neural network models based on the first and second losses includes:
acquiring a first weight coefficient corresponding to the first loss and acquiring a second weight coefficient corresponding to the second loss;
weighting calculation is carried out on the first loss and the second loss based on the first weight coefficient and the second weight coefficient, so that a first target loss is obtained;
model parameters of the first neural network model, the second neural network model and the third neural network model are updated according to the first target loss.
In embodiments of the present disclosure, when updating parameters of the speech encoder, the pair Ji Shi adapter, and the text decoder based on the first loss and the second loss, the first target loss may be calculated from the first loss and the second loss, and then the parameters of the speech encoder, the pair Ji Shi adapter, and the text decoder may be updated based on the first target loss. When the first target loss is calculated according to the first loss and the second loss, different weight coefficients can be set for the first loss and the second loss respectively to adjust the importance of the different losses to the model training effect.
In some embodiments, the weight coefficients corresponding to the first loss and the second loss may be dynamically adjusted in different stages of model training, so that the model focuses more on the difference between the overall output speech relation predicted value and the speech relation label in the initial stage of training, so that the model learning obtains an accurate speech relation extraction capability, and the weights of CTC losses between the text features of the first sample and the text features of the second sample are gradually increased in the middle stage of training, so that the model gradually learns to obtain a good modal alignment capability. So can promote the stability of the whole training process of model.
Wherein, as described above, CTC loss can achieve the alignment at token level between the alignment feature output by Ji Shi adapter and the text feature output by text encoder, and text relation extraction focuses not only on token level but also on information of entity level and sentence level. Thus, the present disclosure also provides the ability for models to learn to get entity-level and statement-level alignment through additional loss designs during model training.
In some embodiments, performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on the fourth neural network model to obtain a second sample text feature, and then further including:
identifying an entity text in the sample text, and identifying an entity feature corresponding to the entity text in the second sample text feature;
Performing modal transformation on the entity features in the second sample text features based on the first sample text features to obtain mixed modal features;
Performing feature decoding on the mixed mode features based on the third neural network model to obtain a second predicted voice relation text, and performing feature decoding on the second sample text features based on the third neural network model to obtain a third predicted voice relation text;
Calculating a third loss according to the second predicted speech relationship text and the third predicted speech relationship text;
updating model parameters of the first, second, and third neural network models based on the first and second losses, comprising:
Model parameters of the first, second, and third neural network models are updated based on the first, second, and third losses.
In the disclosed embodiments, the model training process is constrained by designing a third penalty (L KL (t→m)) to enable the pair Ji Shi adapter to physically alleviate the modal gap between the output of the speech encoder and the input of the text decoder. Specifically, in the model training process, for each training sample of the batch, when the second text sample feature output by the text encoder is calculated, the entity text in the sample text is identified, and the entity feature corresponding to the entity text is identified in the second sample text. There are many relevant studies on entity recognition in the text, which are not the focus of the present application and are not described in detail herein. The text features are feature sequences organized by word units, and obvious boundaries exist between words, so that after the entity text is determined, the entity features can be easily identified in the second sample text features.
Then, the first text feature output by the Ji Shi adapter and the second text feature output by the text encoder can be fused on the entity level based on the identified entity feature, so as to obtain the mixed mode feature. The process of the fusion may specifically be performing modal conversion on the entity feature in the second sample text feature, and converting the entity feature from a text mode to a voice mode.
Further, a text decoder may be used to perform feature decoding on the mixed mode feature and the second sample text feature, respectively, to obtain a second predicted speech relationship text and a third predicted speech relationship text, respectively. The cross entropy between the two can then be calculated to obtain a third loss.
In this way, by performing update training on the parameters of the speech encoder, the Ji Shi adapter and the text decoder through the first loss, the second loss and the third loss designed in the embodiment, the Ji Shi adapter can learn to obtain the capability of alleviating the modal gap between the speech feature and the text feature at the word level and the entity level. Furthermore, the voice encoder, the Ji Shi adapter and the text decoder which are obtained based on training in the embodiment are used for extracting the voice relation, so that the accuracy of extracting the voice relation can be further improved.
As shown in fig. 5, a schematic diagram of the training of the speech relation extraction model provided in the present disclosure is shown. As shown, after performing modal conversion on the entity features in the second text features based on the first text features to obtain mixed modal features, the mixed modal features and the second text features may be respectively input into the text decoder 330 to perform text feature decoding to obtain a second predicted value (i.e., a second predicted speech relationship text) and a third predicted value (i.e., a third predicted speech relationship text), and then calculating cross entropy based on the second predicted value and the third predicted value to obtain a third loss.
In some embodiments, performing modal transformation on the entity features in the second sample text features based on the first sample text features to obtain mixed modal features, including:
Performing attention calculation on each entity characteristic and the first sample characteristic to obtain an entity voice characteristic corresponding to each entity;
And replacing the entity characteristic corresponding to each entity in the second sample text characteristic with the corresponding entity voice characteristic to obtain the mixed mode characteristic.
In the embodiment of the disclosure, the alignment of the entity layer is completed once by replacing the entity representation in the text feature sequence (i.e. the second sample text feature) with its corresponding voice representation and keeping the output distribution of the decoder unchanged. However, since the speech features are continuous and have no word boundaries, it is not possible to simply determine the speech feature corresponding to each entity from the speech features (the features corresponding to the Ji Shi adapter outputs). In the disclosed embodiment, the entity voice characteristics of each entity (i.e., the voice representation of each entity) are determined from the attention calculation results by performing an attention calculation on the entity characteristics of each entity and the first sample characteristics outputted by the Ji Shi adapter.
Further, the entity characteristics corresponding to each entity in the second sample text characteristics can be replaced by the corresponding entity voice characteristics, so that the mode fusion of the entity characteristic layers can be realized, and the mixed mode characteristics are obtained.
In some embodiments, after feature decoding the mixed mode feature based on the third neural network model to obtain the second predicted speech relationship text, and feature decoding the second sample text feature based on the third neural network model to obtain the third predicted speech relationship text, the method further includes:
calculating to obtain a fourth loss according to the second predicted voice relation text and the voice relation label, and calculating to obtain a fifth loss according to the third predicted voice relation text and the voice relation label;
Updating model parameters of the first, second, and third neural network models based on the first, second, and third losses, comprising:
Calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss;
and updating model parameters of the first neural network model, the second neural network model and the third neural network model according to the second target loss.
In the embodiment of the disclosure, after the corresponding second predicted speech relation text and the third predicted speech relation text are calculated according to the sample of one batch in the model training stage, not only the cross entropy loss of the second predicted speech relation text and the third predicted speech relation text can be calculated based on the calculated second predicted speech relation text and the third predicted speech relation text, but also the fourth loss and the fifth loss can be further designed based on the cross entropy between the second predicted speech relation text and the speech relation label respectively. The predicted voice relationship obtained by feature decoding of the text features output by the text encoder and the mixed mode features by the text decoder can be closer to the sample tag by minimizing the fourth loss and the fifth loss, the relieving capability of the Ji Shi adapter on the mode gap between the voice features and the text features can be reversely improved, and the extracting accuracy of the voice relationship can be further improved.
Specifically, a fourth loss (L CE (m)) may be calculated by first calculating the cross entropy between the second predicted speech relationship text and the speech relationship tag, and a fifth loss (L CE (t)) may be calculated by calculating the cross entropy between the third predicted speech relationship text and the speech relationship tag.
Then, a joint loss may be calculated from the above-described first loss, second loss, third loss, and fourth and fifth losses in the present embodiment, resulting in a second target loss. Further, the model parameters may be updated based on the second target loss.
In some embodiments, performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on the fourth neural network model to obtain a second sample text feature, and then further including:
inputting the first sample text feature into a fifth neural network model for feature compression to obtain a first sentence feature, and inputting the second sample text feature into the fifth neural network model for feature compression to obtain a second sentence feature;
performing semantic projection on the first sentence feature based on the sixth neural network model to obtain a third sentence feature, and performing semantic projection on the second sentence feature based on the sixth neural network model to obtain a fourth sentence feature;
calculating a sixth loss according to the third statement feature and the fourth statement feature;
Calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss, comprising:
the second target loss is calculated from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, and the sixth loss.
In the disclosed embodiment, a sixth penalty is further designed to train the ability of the Ji Shi adapter to alleviate modal gap between speech features and text features at the sentence level. Specifically, in the embodiment of the present disclosure, a fifth neural network model for performing feature compression and a sixth neural network model for performing semantic projection are further introduced. The fifth neural network model may be specifically a semantic compression layer (semantic compression layer), and the sixth neural network model may be specifically a semantic projection layer (Semantic projection layer). And then, respectively inputting the first sample text features output by the Ji Shi adapter and the second sample text features output by the text encoder into a fifth neural network model for feature compression to obtain corresponding first sentence features and second sentence features. And then, further inputting the first sentence characteristic and the second sentence characteristic into a sixth neural network model respectively for characteristic projection to obtain a third sentence characteristic and a fourth sentence characteristic which are correspondingly output.
In some embodiments, based on the specific step of performing semantic compression on the first sample text feature by using the fifth neural network model to obtain the first sentence feature, the first sample text feature may be first subjected to semantic compression by using the fifth neural network model to obtain the compressed feature, and then the compressed feature obtained by semantic compression and the first sample text feature are subjected to feature stitching to obtain the first sentence feature. Likewise, the second sentence feature may be obtained by splicing the second sample text feature with the corresponding semantic compression feature.
In some embodiments, the semantic compression layer and the semantic projection layer may also be included in the Ji Shi adapter, where the CNN in the Ji Shi adapter processes the speech feature output by the speech encoder, and then sends the processed speech feature to the semantic compression layer and the semantic projection layer to process the processed speech feature to obtain a third sentence feature.
Further, a sixth penalty may be constructed from the third sentence feature and the fourth sentence feature, specifically, an L2 distance between the feature vector corresponding to the third sentence feature and the feature vector corresponding to the fourth sentence feature may be calculated, resulting in a sixth penalty (L L2), and this L2 distance is then minimized to enable learning of the ability of the Ji Shi adapter to result in sentence-level alignment.
As shown in fig. 6, another schematic diagram of the speech relationship extraction model training provided by the present disclosure. As shown, after the Ji Shi adapter converts the voice feature extracted from the voice data by the voice encoder into the first text feature, the semantic compression layer 610 and the semantic projection layer 620 may be used to perform semantic compression and projection processing on the first text feature to obtain a third sentence feature; similarly, for the second text feature extracted from the text data by the text encoder, the semantic compression layer 610 and the semantic projection layer 620 may also be used to perform semantic compression and projection processing on the second text feature to obtain a fourth sentence feature. Further, calculating the L2 distance between the feature vectors corresponding to the third text feature and the fourth text feature to obtain a sixth loss, and then combining the sixth loss and the 5 loss to update parameters of the speech encoder, the Ji Shi adapter and the text decoder. By combining the 6 losses to train, feature alignment can be achieved from three layers of a word layer, an entity layer and a sentence layer, so that modal gap between audio features and text features can be greatly relieved, and further the accuracy of voice relation extraction can be greatly improved.
In some embodiments, after performing semantic projection on the first sentence feature based on the sixth neural network model to obtain the third sentence feature, and performing semantic projection on the second sentence feature based on the sixth neural network model to obtain the fourth sentence feature, the method further includes:
Performing feature decoding on the third sentence feature based on the third neural network model to obtain a fourth predicted speech relation text, and performing feature decoding on the fourth sentence feature based on the third neural network model to obtain a fifth predicted speech relation text;
calculating a seventh loss according to the fourth predicted speech relationship text and the fifth predicted speech relationship text;
Calculating a second target loss from the first, second, third, fourth, fifth, and sixth losses, comprising:
Calculating a second target loss according to the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss and the seventh loss.
The sixth loss realizes alignment of feature vectors of the speech sentence features and the text sentence features on the distance L2, and in order to further improve the alignment capability of the speech sentence features and the text sentence features, a seventh loss can be further designed to restrict alignment between predicted values obtained by text decoding of the speech sentence features and the text sentence features by a text decoder.
Specifically, after the third sentence feature and the fourth sentence feature are obtained, a text decoder may be further used to perform feature decoding on the third sentence feature and the fourth sentence feature to obtain a corresponding fourth predicted speech relationship text and a corresponding fifth predicted speech relationship text, respectively. Then, a seventh loss (L KL (t→m)) can be obtained by calculating the cross entropy between the fourth speech relationship text and the fifth predicted speech relationship text. A second target loss is then calculated based on the 7 losses, and parameters of the speech encoder, the Ji Shi adapter, and the text decoder are updated based on the second target loss.
In some embodiments, calculating the second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss, and the seventh loss comprises:
Acquiring training rounds;
calculating a third weight coefficient corresponding to each loss based on the training round;
and weighting and calculating the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss and the seventh loss based on the third weight coefficient corresponding to each loss to obtain a second target loss.
In the embodiment of the application, different weights for the 7 loss designs can be adopted to adjust the importance of the loss in different stages of model training, so as to control the stability of the model training process. For example, among the above-mentioned losses, the cross entropy loss between the third predicted speech relationship text and the speech relationship label, i.e., the fifth loss, of the text features extracted by the text encoder and predicted by the text decoder has a large influence on the stability of model training. Then the loss may be given a greater weight at the beginning of model training and less weight for other losses; as the training process continues to advance, the fifth loss may be progressively reduced in weight and the other losses increased in weight.
In particular, a round of model training may be acquired at each round of training, the round of model training indicating a stage of model training. Then, a weight coefficient (which may be referred to herein as a third weight coefficient) corresponding to each loss is calculated according to the obtained turn of model training, and then the multiple losses are weighted according to the third weight coefficient corresponding to each loss, so as to obtain a second target loss.
In some embodiments, performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on the fourth neural network model to obtain a second sample text feature, and then further including:
inputting the first text feature into a fifth neural network model for feature compression, and inputting the compressed feature into a sixth neural network model for semantic projection to obtain a fifth sentence feature;
Inputting the text features of the second sample into a fifth neural network model for feature compression, and inputting the compressed features into a sixth neural network model for semantic projection to obtain sixth sentence features;
calculating an eighth loss according to the fifth statement feature and the sixth statement feature;
updating model parameters of the first, second, and third neural network models based on the first and second losses, comprising:
Model parameters of the first, second, and third neural network models are updated based on the first, second, and eighth losses.
Among the foregoing embodiments, the solution in token layer alignment, token to entity bond layer alignment, and token, entity, and statement bond layer alignment are described. In the embodiment of the disclosure, a scheme for aligning the token layer with the sentence-combining layer is also provided.
Specifically, by designing an eighth penalty for statement level feature alignment alone without regard to entity level feature alignment, the Ji Shi adapters are made to learn the ability of the token to bind to the statement level. Specifically, after performing modal conversion on the voice feature output by the voice encoder by using the Ji Shi adapter to obtain a first text feature, performing feature compression on the first text feature by using the semantic compression layer, and performing semantic projection on the compressed feature by using the semantic projection layer to obtain a fifth sentence feature; and processing the second sample text features output by the text encoder by adopting a semantic compression layer and a semantic projection layer to obtain sixth sentence features. Then, an eighth penalty may be calculated for the L2 distance between the fifth sentence feature and the sixth sentence feature, and thus, parameters of the semantic encoder, the text encoder, and the text decoder may be updated further based on the third target penalty constructed by combining the first penalty, the second penalty, and the eighth penalty.
And 240, performing feature decoding on the text features based on the third neural network model to obtain the voice relation text of the target voice data.
The embodiment provides a plurality of loss design methods to train the voice relation extraction model provided by the scheme so as to obtain a pair Ji Shi adapter capable of relieving the modal gap between the voice characteristics and the text characteristics from different layers, and further obtain a plurality of voice relation extraction models capable of improving the accuracy of voice relation extraction to different degrees. It is to be understood that the training process of the speech relation extraction model may be performed in the speech relation extraction device that performs the speech relation extraction method provided in the present disclosure, or may be performed in other devices, which is not limited herein.
After training to obtain any of the above-mentioned speech relationship extraction models, the speech relationship extraction models may be deployed online to perform speech relationship extraction tasks. In the model reasoning stage, that is, when the speech relation extraction model obtained by training is adopted to extract the speech relation of the target speech data, the text encoder 420 is not required. After the trained speech encoder 310 is used to extract the speech features from the target speech data, the trained speech encoder Ji Shi may be used directly to perform feature mode conversion on the speech features to obtain text features. Further, the trained text decoder 330 may be used to perform text decoding on the text features, so as to obtain an accurate relation triplet corresponding to the target voice data.
In some embodiments, before feature decoding is performed on the text feature based on the third neural network model to obtain the voice relation text of the target voice data, the method further includes:
performing language identification on the target voice data to obtain a language description text;
extracting text features of the language description text to obtain language text features;
splicing the language text features and the text features to obtain target text features;
feature decoding is carried out on the text features based on a third neural network model to obtain a voice relation text of target voice data, and the method comprises the following steps:
And performing feature decoding on the target text features based on the third neural network model to obtain the voice relation text of the target voice data.
In the foregoing embodiment, the focus is on the modal alignment between the speech feature obtained by feature extraction of the speech and the text feature obtained by text feature extraction of the text after the modal conversion, however, the text data corresponding to the speech data may not completely express all the information of the speech data, and the speech data also includes information of other dimensions such as language information. In order to further improve the accuracy of extracting the voice relation of the voice data, the target voice data can be subjected to language recognition before the text features output by the Ji Shi adapter are input into the text decoder for decoding, so that language description text can be obtained. And then, text feature extraction can be carried out on the language description text based on the trained text encoder, so as to obtain language text features. Further, the text tone features can be spliced with the text features output by the Ji Shi adapter to obtain target text features. And then, performing feature decoding on the target text features by adopting a text decoder to obtain output relation triples.
It can be understood that if the speech data is required to be subjected to the language text feature extraction in the speech relation extraction stage, in the training stage of the model, a language text feature extraction model may be set and corresponding loss may be designed to make the language text feature and the text feature obtained by fusing the text feature output by the Ji Shi adapter approach to the speech relation tag as much as possible.
According to the voice relation extraction method provided by the embodiment of the disclosure, target voice data to be subjected to voice relation extraction is obtained; performing voice feature extraction on the target voice data based on the first neural network model to obtain voice features; inputting the voice characteristics into a second neural network model to perform characteristic mode conversion to obtain text characteristics; performing feature decoding on the text features based on the third neural network model to obtain a voice relation text of the target voice data; the first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
According to the embodiment of the disclosure, the first loss obtained by calculating the first text characteristics output according to the second neural network model and the second text characteristics obtained by extracting the text characteristics of the sample text corresponding to the sample voice is adopted to guide the second neural network model to learn, so that the characteristics obtained by converting the characteristics of the voice characteristics by the second neural network model are more similar to the text characteristics directly extracted from the text corresponding to the voice. Thus, the mode difference between the voice characteristics extracted from the voice data and the text characteristics to be extracted in the voice relation is relieved by training the second neural network model which can accurately convert the characteristic modes from the voice characteristics to the text characteristics of the voice data. Compared with the prior art that the voice features are converted into the features with the same length as the text features, the second neural network model provided by the application can effectively relieve the modal difference in the process of converting the voice features into the text features, so that the accuracy of the text features obtained by modal conversion can be improved, and the accuracy of voice relation extraction can be further improved.
As shown in fig. 7, for another flow chart of the voice relation extraction method provided in the present disclosure, the voice relation extraction method will be described in detail with reference to the execution subject of each step. The method specifically comprises the following steps:
In step 701, a computer device builds a speech relationship extraction model and obtains training sample data.
In the embodiment of the present disclosure, a specific example will be used to describe the voice relation extraction method provided in the present disclosure in detail. The method is characterized in that the method comprises the steps of constructing a voice relation extraction model and training. The speech method extraction model constructed in the speech relation extraction method and the speech method extraction model obtained by training the training method of the model in the method are applied to the speech relation extraction task, so that the accuracy of speech relation extraction can be improved. The computer device in the embodiment of the application can be a terminal or a server.
Fig. 8 is an overall schematic diagram of a voice relationship extraction method provided in the present disclosure. As shown, the overall model structure of the speech relation extraction model provided by the present disclosure is shown, as well as the data flow direction and the loss of construction during model training. Specifically, the computer device may first construct a speech relationship extraction model of the structure shown in FIG. 8, where the speech relationship extraction model includes a speech encoder, a text encoder, a pair Ji Shi adapter, and a text decoder. The voice encoder can specifically adopt a pre-trained wav2vec2.0 model; which aims at extracting a sequence of speech features from a sample cause. The text encoder and the text decoder can specifically adopt a bi-directional encoder and an autoregressive decoder in a BART model; the text encoder is intended to extract text features from the sample text and the text decoder is intended to autoregressive relation triples contained in the first generated audio based on speech features, in particular features after modal conversion of the speech features. The pair Ji Shi adapter specifically includes a Convolutional Neural Network (CNN) 810, and a semantic compression layer 610 and a semantic projection layer 620. The Ji Shi adapter aims to project the speech features output by the speech encoder to the semantic space of the text decoder, thereby alleviating the modal gap between the speech encoder and the text decoder.
After the voice relation extraction model is constructed, training sample data can be obtained to train the voice relation extraction model. The training sample data can specifically extract the sample text and the corresponding label data from the text relation extraction data set. Then, for the sample text in each sample, a speech synthesis technique may be employed to synthesize the corresponding sample speech. In the embodiment of the disclosure, a real person can be adopted to convert the sample text into corresponding voice, so that the use effect of the voice relation extraction model obtained by the scheme training under the real scene is further improved. As shown in table 1, a schematic representation of the data set employed in the present application.
Table 1 training sample data schematic form
In step 702, the computer device performs speech feature extraction on the sample speech based on the speech encoder to obtain sample speech features, and performs text feature extraction on the sample text based on the text encoder to obtain sample text features.
After the speech relation extraction model is constructed and training sample data is obtained, a batch of samples can be selected from the training sample data. The computer device can then extract speech features from the sample speech based on the speech encoder, and can extract text features from the sample text based on the text encoder.
In step 703, the computer device performs modal conversion on the sample voice feature based on the CNN in the Ji Shi adapter, to obtain a modal alignment feature.
Further, the computer device may perform feature-mode conversion on the sample speech extracted by the speech encoder based on the CNN in the Ji Shi adapter, and specifically may convert the feature length of the sample speech feature to the feature length of the sample text feature based on the CNN, so as to obtain a mode alignment feature. CTC losses can be constructed based on the modality alignment features and sample text features to train alignment capabilities of the Ji Shi adapter at the token level.
In step 704, the computer device updates the entity features in the sample text features based on the modality alignment features to obtain fused modality features.
Further, the computer device updates the entity features in the sample text features based on the modality alignment features, specifically, may be converting the entity features in the sample text features into corresponding speech feature expressions, thereby obtaining the fused modality features.
Specifically, the phonetic representation of each word in the sample text may be obtained by the following formula:
Wherein, in particular, AndRespectively, the text encoder and the speech encoder output feature sequences, where the speech encoder output feature sequences may refer to the modal alignment features in particular. Then, the entity is arranged inIs replaced by a phonetic representation of the sameTo construct a fused modality feature sequence. Based on the predicted value obtained by decoding the fused modal feature and the sample text feature, L KL (t-m) loss can be constructed to train the alignment capability of the Ji Shi adapter on the entity level.
Step 705, the computer device performs feature conversion on the modality alignment feature, the fusion modality feature and the sample text feature based on the semantic compression layer and the semantic projection layer in the Ji Shi adapter, so as to obtain the modality alignment sentence feature, the fusion modality sentence feature and the text sentence feature which are respectively corresponding.
Further, after the modal alignment feature, the fused modal feature and the sample text feature are obtained, the modal alignment feature, the fused modal feature and the sample text feature can be input to a semantic compression layer and a semantic projection layer of the Ji Shi adapter for processing. The output of the semantic compression layer may specifically be a concatenation feature of a result obtained by compressing the feature and the original feature. And then, carrying out semantic projection processing on the spliced features by the semantic projection layer to obtain modal alignment sentence features, fusion modal sentence features and text sentence features.
Wherein, L L2 loss can be constructed according to the alignment statement feature and the text statement feature to enable training of the Ji Shi adapter to obtain the alignment capability of the statement level.
In step 706, the computer device performs feature decoding on the modal alignment sentence feature, the fusion modal sentence feature, and the text sentence feature based on the text decoder to obtain a speech predicted value, a hybrid predicted value, and a text predicted value.
Further, after obtaining the modal alignment sentence feature, the fusion modal sentence feature and the text sentence feature output by the Ji Shi adapter, the computer device may further input the modal alignment sentence feature, the fusion modal sentence feature and the text sentence feature into the text decoder to decode, so as to obtain a speech predicted value, a hybrid predicted value and a text predicted value respectively. The speech prediction value can be specifically a speech relation prediction value predicted by an original sample speech through a speech extraction model; the mixed predicted value can be a voice relation predicted value obtained by predicting the fusion modal feature through a voice relation extraction model provided by the disclosure; the text predicted value can be a speech relation predicted value obtained by predicting the sample text characteristics through the speech relation extraction model provided by the application.
In this way, the training of the relation extraction model based on the voice is guided by using the text predicted value (corresponding to the relation extraction model based on the text) as a teacher (teacher) model as the task of carrying out relation extraction from the text to obtain good effect. I.e., building a penalty based on the speech and text predictions to guide the training of the speech-based relational extraction model.
In step 707, the computer device calculates a first penalty based on the modality alignment feature and the sample text feature, a second penalty based on the modality alignment statement feature and the text statement feature, a third penalty based on the speech predictor and the text predictor, and a fourth penalty based on the hybrid predictor and the text predictor.
After the speech predicted value, the mixed predicted value and the text predicted value are calculated, model loss can be constructed according to the intermediate characteristics, the predicted value, the labels and the like.
Specifically, a first penalty may be first constructed based on the modality alignment feature and the sample text feature, where the first penalty may specifically be a CTC penalty. Specifically, sample text features corresponding to all samples in one batch can be spliced, dot products of each modal alignment feature and the spliced features in the batch are calculated to obtain alignment distribution between the modal alignment feature and the sample text features, and CTC loss is further calculated according to the alignment distribution.
And the computer device may calculate the second loss according to the modality alignment statement feature and the text statement feature, specifically, may calculate an L2 distance between feature vectors corresponding to the two, and obtain the second loss.
And, the computer device may also calculate a third penalty based on the speech prediction value and the text prediction value. The specific formula is as follows:
Wherein sg (& gt) is a gradient cut-off operation, and KL (& gt) represents calculating KL divergence; a sequence of speech features is represented and, Representing a sequence of text features. The third penalty is used to guide training of the speech-based relationship extraction model with the text-based relationship extraction model.
And, the computer device may further calculate a fourth loss according to the mixed predicted value and the text predicted value, which is specifically disclosed as follows:
where H m represents the fusion modality feature and H t represents the sample text feature.
At step 708, the computer device calculates a fifth loss based on the speech predictor and the sample tag, calculates a sixth loss based on the hybrid predictor and the sample tag, and calculates a seventh loss based on the text predictor and the sample tag.
In addition to the above four losses, several other losses can be constructed in the present application, specifically, the computer device can calculate the fifth loss according to the speech prediction value and the sample label, and the formula is as follows:
Where CE represents cross entropy and yi represents sample label.
And, the computer device may calculate a sixth loss based on the mixed predicted value and the sample tag, as follows:
Wherein, Representing a sequence of mixed modality features.
And, the computer device may also calculate a seventh loss based on the text prediction value and the sample tag, as follows:
In step 709, the computer device calculates the target loss based on the seven losses, and iteratively updates the model parameters of the speech relation extraction model according to the target loss, to obtain a trained speech relation extraction model.
After the seven losses are calculated, the target loss may be further calculated according to the seven losses, and the specific formula is as follows:
where α and β are hyper-parameters, and in order to further enhance the stability of model training, a course learning strategy may be employed to dynamically adjust the values of these two hyper-parameters. The specific formula is as follows:
wherein T is the training iteration number, T is the temperature parameter, AndRepresenting the maximum of two super-parameters.
After the target loss is calculated, a back-pass gradient can be calculated based on the target loss, and gradient back-pass is performed based on the gradient to iteratively update model parameters of the speech relation extraction model until the model parameters converge. And (5) completing training of the speech relation extraction model.
In step 710, the computer device deploys the trained speech relationship extraction model on-line.
After training of the speech relation extraction model is completed, the model can be deployed on line. The model deployment can be directly deployed on the trained computer equipment or on other terminals.
Specifically, the voice relation extraction model is deployed online, and particularly, only a voice encoder, a pair Ji Shi adapter and a text decoder are required to be deployed, so that a text encoder is not required to be deployed.
In step 711, the computer device obtains the target voice, performs feature extraction on the target voice based on the voice encoder, performs modal conversion on the extracted features by adopting a Ji Shi adapter, and performs feature decoding on the features after the modal conversion by adopting a text decoder to obtain a relation triplet.
After the trained speech relation extraction model is deployed on line, the task can be extracted based on the accepted speech relation. Specifically, when a voice relation extraction task is received, target voice can be obtained, voice feature extraction is performed on the target voice based on a voice encoder, modal conversion is further performed on the extracted features by using a Ji Shi adapter, specifically, the features are mapped into a semantic space of a text decoder, and then the text decoder is used for decoding the features, so that a relation triplet corresponding to the target voice can be obtained.
As shown in table 2 below, the effect of the voice relationship extraction model provided by the present disclosure on the selected data set is compared with the voice relationship extraction model in the related art.
Table 2 model effect alignment table
As shown in the table, in data set 1, the speech relationship extraction model provided by the present disclosure performs better than the correlation model in terms of precision recall (ER), reciprocal Ranking (RP), and Relative Time Error (RTE) metrics.
It will be appreciated that, although the steps in the various flowcharts described above are shown in succession in the order indicated by the arrows, the steps are not necessarily executed in the order indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.
In the various embodiments of the present disclosure, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and the collection, use, processing, etc. of the data complies with relevant laws and regulations and standards of the related region. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.
Fig. 9 is a schematic structural diagram of a voice relationship extraction apparatus 900 according to an embodiment of the disclosure. The device comprises:
an obtaining unit 910, configured to obtain target voice data to be extracted from a voice relationship;
an extracting unit 920, configured to perform voice feature extraction on the target voice data based on the first neural network model, so as to obtain voice features;
the conversion unit 930 is configured to input the speech feature into the second neural network model to perform feature mode conversion, so as to obtain a text feature;
A decoding unit 940, configured to perform feature decoding on the text feature based on the third neural network model, to obtain a speech relationship text of the target speech data;
The first neural network model, the second neural network model and the third neural network model are obtained based on target loss joint training, and the target loss comprises a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through a first neural network model and then performing feature mode conversion through a second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on the predicted voice relation text and the voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
Optionally, in some embodiments, the joint training process of the first neural network model, the second neural network model, and the third neural network model is implemented by a training unit, the training unit comprising:
The training system comprises an acquisition subunit, a training unit and a processing unit, wherein the acquisition subunit is used for acquiring training sample data, and the training sample data comprises a plurality of sample voices and voice relation labels corresponding to each sample voice;
the first extraction subunit is used for extracting voice features of the sample voice based on the first neural network model to obtain sample voice features, and carrying out feature mode conversion on the sample voice features based on the second neural network model to obtain first sample text features;
The second extraction subunit is used for carrying out voice recognition on the sample voice to obtain a corresponding sample text, and carrying out text feature extraction on the sample text based on a fourth neural network model to obtain a second sample text feature, wherein the output feature of the fourth neural network model is consistent with the input feature size of the third neural network model;
A first calculating subunit, configured to calculate a join timing classification loss according to the first sample text feature and the second sample text feature, to obtain a first loss;
The first decoding subunit is used for carrying out feature decoding on the first sample features based on the third neural network model to obtain a first predicted voice relation text, and calculating to obtain a second loss according to the first predicted voice relation text and the voice relation label;
And the updating subunit is used for updating the model parameters of the first neural network model, the second neural network model and the third neural network model based on the first loss and the second loss.
Optionally, in some embodiments, updating the subunit includes:
the acquisition module is used for acquiring a first weight coefficient corresponding to the first loss and acquiring a second weight coefficient corresponding to the second loss;
The first calculation module is used for carrying out weighted calculation on the first loss and the second loss based on the first weight coefficient and the second weight coefficient to obtain a first target loss;
and the first updating module is used for updating the model parameters of the first neural network model, the second neural network model and the third neural network model according to the first target loss.
Optionally, in some embodiments, the training unit further comprises:
the first recognition subunit is used for recognizing the entity text in the sample text and recognizing the entity characteristic corresponding to the entity text in the second sample text characteristic;
The conversion subunit is used for carrying out modal conversion on the entity characteristics in the second sample text characteristics based on the first sample text characteristics to obtain mixed modal characteristics;
The second decoding subunit is used for carrying out feature decoding on the mixed mode features based on the third neural network model to obtain a second predicted voice relation text, and carrying out feature decoding on the second sample text features based on the third neural network model to obtain a third predicted voice relation text;
a second calculation subunit configured to calculate a third loss according to the second predicted speech relationship text and the third predicted speech relationship text;
the update subunit is further configured to:
Model parameters of the first, second, and third neural network models are updated based on the first, second, and third losses.
Optionally, in some embodiments, the transformant unit comprises:
The second calculation module is used for carrying out attention calculation on each entity characteristic and the first sample characteristic to obtain an entity voice characteristic corresponding to each entity;
and the replacing module is used for replacing the entity characteristic corresponding to each entity in the second sample text characteristic with the corresponding entity voice characteristic to obtain the mixed mode characteristic.
Optionally, in some embodiments, the training unit further comprises:
The third calculation subunit is used for calculating fourth loss according to the second predicted voice relation text and the voice relation label and calculating fifth loss according to the third predicted voice relation text and the voice relation label;
an update subunit comprising:
A third calculation module for calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss;
And the second updating module is used for updating the model parameters of the first neural network model, the second neural network model and the third neural network model according to the second target loss.
Optionally, in some embodiments, the training unit further comprises:
The compressing subunit is used for inputting the first sample text characteristic into the fifth neural network model for characteristic compression to obtain a first sentence characteristic, and inputting the second sample text characteristic into the fifth neural network model for characteristic compression to obtain a second sentence characteristic;
The projection subunit is used for carrying out semantic projection on the first sentence characteristic based on the sixth neural network model to obtain a third sentence characteristic, and carrying out semantic projection on the second sentence characteristic based on the sixth neural network model to obtain a fourth sentence characteristic;
A fourth calculation subunit, configured to calculate a sixth loss according to the third sentence feature and the fourth sentence feature;
The third calculation module is further used for:
the second target loss is calculated from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, and the sixth loss.
Optionally, in some embodiments, the training unit further comprises:
The third decoding subunit is used for carrying out feature decoding on the third sentence feature based on the third neural network model to obtain a fourth predicted voice relation text, and carrying out feature decoding on the fourth sentence feature based on the third neural network model to obtain a fifth predicted voice relation text;
A fifth calculation subunit configured to calculate a seventh loss according to the fourth predicted speech relationship text and the fifth predicted speech relationship text;
The third calculation module is further used for:
Calculating a second target loss according to the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss and the seventh loss.
Optionally, in some embodiments, the third computing module includes:
The acquisition sub-module is used for acquiring training rounds;
The first calculation sub-module is used for calculating a third weight coefficient corresponding to each loss based on training rounds;
The second calculation sub-module is configured to perform weighted calculation on the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss, and the seventh loss based on a third weight coefficient corresponding to each loss, so as to obtain a second target loss.
Optionally, in some embodiments, the training unit further comprises:
the first processing subunit is used for inputting the first text feature into a fifth neural network model for feature compression, inputting the compressed feature into a sixth neural network model for semantic projection to obtain a fifth sentence feature;
the second processing subunit is used for inputting the text features of the second sample into the fifth neural network model for feature compression, inputting the compressed features into the sixth neural network model for semantic projection to obtain sixth sentence features;
a sixth calculation subunit for calculating an eighth loss according to the fifth sentence feature and the sixth sentence feature;
the update subunit is further configured to:
updating model parameters of the first, second, and third neural network models based on the first, second, and eighth losses
Optionally, in some embodiments, the voice relation extracting apparatus provided by the present disclosure further includes:
The second recognition subunit is used for carrying out language recognition on the target voice data to obtain a language description text;
The third extraction subunit is used for extracting text features of the language description text to obtain language text features;
The splicing subunit is used for splicing the mood text features and the text features to obtain target text features;
decoding unit, still be used for:
And performing feature decoding on the target text features based on the third neural network model to obtain the voice relation text of the target voice data.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
Referring to fig. 10, fig. 10 is a block diagram of a portion of a terminal 140 implementing a voice relationship extraction method according to an embodiment of the present disclosure, the terminal 140 including: radio Frequency (RF) circuit 1010, memory 1015, input unit 1030, display unit 1040, sensor 1050, voice circuit 1060, wireless fidelity (WIRELESS FIDELITY, wiFi) module 1070, processor 1080, and power source 1090. It will be appreciated by those skilled in the art that the terminal 140 structure shown in fig. 10 is not limiting of a cell phone or computer and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
The RF circuit 1010 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 1080; in addition, the data of the design uplink is sent to the base station.
The memory 1015 may be used to store software programs and modules, and the processor 1080 performs various functional applications of the terminal and document editing by executing the software programs and modules stored in the memory 1015.
The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to setting and function control of the terminal. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032.
The display unit 1040 may be used to display input information or provided information and various menus of the terminal. The display unit 1040 may include a display panel 1041.
Voice circuit 1060, speaker 1061, microphone 1062 may provide a voice interface.
In this embodiment, the processor 1080 included in the terminal 140 may perform the voice relationship extraction method of the previous embodiment.
The terminal 140 of the embodiments of the present disclosure includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc.
Fig. 11 is a block diagram of a portion of a server 110 implementing a voice relationship extraction method according to an embodiment of the present disclosure. The server 110 may vary considerably in configuration or performance and may include one or more central processing units (Central Processing Units, simply CPU) 1122 (e.g., one or more processors) and storage 1132, one or more storage mediums 1130 (e.g., one or more mass storage devices) that store applications 1142 or data 1144. Wherein the storage 1132 and the storage medium 1130 may be transitory or persistent. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations on the server 110. Still further, the central processor 1122 may be provided in communication with a storage medium 1130, executing a series of instruction operations in the storage medium 1130 on the server 110.
The server 110 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
The central processor 1122 in the server 110 may be used to perform the voice relationship extraction method of the embodiments of the present disclosure.
The embodiments of the present disclosure also provide a storage medium storing a program code for executing the voice relation extracting method of the foregoing embodiments.
The disclosed embodiments also provide a computer program product comprising a computer program. The processor of the computer device reads the computer program and executes it, causing the computer device to execute the voice relationship extraction method described above.
The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.
It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.
The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims (15)

1. A method for extracting a speech relationship, the method comprising:
acquiring target voice data to be subjected to voice relation extraction;
Extracting voice characteristics of the target voice data based on a first neural network model to obtain voice characteristics;
inputting the voice characteristics into a second neural network model for characteristic mode conversion to obtain text characteristics;
performing feature decoding on the text features based on a third neural network model to obtain a voice relation text of the target voice data;
Wherein the first neural network model, the second neural network model, and the third neural network model are obtained based on a target loss joint training, the target loss including a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through the first neural network model and then performing feature mode conversion through the second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on a predicted voice relation text and a voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
2. The method of claim 1, wherein the joint training process of the first neural network model, the second neural network model, and the third neural network model comprises the steps of:
acquiring training sample data, wherein the training sample data comprises a plurality of sample voices and voice relation labels corresponding to each sample voice;
Extracting voice features of the sample voice based on the first neural network model to obtain sample voice features, and performing feature mode conversion on the sample voice features based on the second neural network model to obtain the first sample text features;
Performing voice recognition on the sample voice to obtain a corresponding sample text, and performing text feature extraction on the sample text based on a fourth neural network model to obtain the second sample text feature, wherein the output feature of the fourth neural network model is consistent with the input feature size of the third neural network model;
calculating a joint time sequence classification loss according to the first sample text characteristic and the second sample text characteristic to obtain a first loss;
Performing feature decoding on the first text feature based on the third neural network model to obtain a first predicted voice relation text, and calculating to obtain a second loss according to the first predicted voice relation text and the voice relation label;
model parameters of the first, second, and third neural network models are updated based on the first and second losses.
3. The method of claim 2, wherein updating model parameters of the first, second, and third neural network models based on the first and second losses comprises:
Acquiring a first weight coefficient corresponding to the first loss and a second weight coefficient corresponding to the second loss;
weighting calculation is carried out on the first loss and the second loss based on the first weight coefficient and the second weight coefficient, so that a first target loss is obtained;
Updating model parameters of the first neural network model, the second neural network model and the third neural network model according to the first target loss.
4. The method according to claim 2, wherein the performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on a fourth neural network model to obtain the second sample text feature, further comprises:
identifying an entity text in the sample text, and identifying an entity feature corresponding to the entity text in the second sample text feature;
Performing modal transformation on the entity features in the second sample text features based on the first sample text features to obtain mixed modal features;
Performing feature decoding on the mixed modal feature based on the third neural network model to obtain a second predicted speech relationship text, and performing feature decoding on the second sample text feature based on the third neural network model to obtain a third predicted speech relationship text;
Calculating a third loss according to the second predicted speech relationship text and the third predicted speech relationship text;
The updating model parameters of the first, second, and third neural network models based on the first and second losses includes:
model parameters of the first, second, and third neural network models are updated based on the first, second, and third losses.
5. The method of claim 4, wherein performing modal transformation on the entity features in the second sample text feature based on the first sample text feature to obtain a mixed modal feature, comprising:
performing attention calculation on each entity characteristic and the first sample characteristic to obtain an entity voice characteristic corresponding to each entity;
And replacing the entity characteristic corresponding to each entity in the second sample text characteristic with the corresponding entity voice characteristic to obtain the mixed mode characteristic.
6. The method of claim 4, wherein after performing feature decoding on the mixed mode feature based on the third neural network model to obtain a second predicted speech relationship text, and performing feature decoding on the second sample text feature based on the third neural network model to obtain a third predicted speech relationship text, further comprising:
calculating a fourth loss according to the second predicted voice relation text and the voice relation tag, and calculating a fifth loss according to the third predicted voice relation text and the voice relation tag;
The updating model parameters of the first, second, and third neural network models based on the first, second, and third losses includes:
Calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss;
and updating model parameters of the first neural network model, the second neural network model and the third neural network model according to the second target loss.
7. The method of claim 6, wherein the performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on a fourth neural network model to obtain the second sample text feature, further comprises:
inputting the first sample text feature into a fifth neural network model for feature compression to obtain a first sentence feature, and inputting the second sample text feature into the fifth neural network model for feature compression to obtain a second sentence feature;
performing semantic projection on the first sentence feature based on a sixth neural network model to obtain a third sentence feature, and performing semantic projection on the second sentence feature based on the sixth neural network model to obtain a fourth sentence feature;
calculating a sixth loss according to the third statement feature and the fourth statement feature;
the calculating a second target loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss, comprising:
calculating a second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, and the sixth loss.
8. The method of claim 7, wherein after the semantic projection of the first sentence feature based on the sixth neural network model to obtain a third sentence feature and the semantic projection of the second sentence feature based on the sixth neural network model to obtain a fourth sentence feature, further comprising:
performing feature decoding on the third sentence feature based on the third neural network model to obtain a fourth predicted voice relation text, and performing feature decoding on the fourth sentence feature based on the third neural network model to obtain a fifth predicted voice relation text;
Calculating a seventh loss according to the fourth predicted speech relationship text and the fifth predicted speech relationship text;
Said calculating a second target loss from said first loss, said second loss, said third loss, said fourth loss, said fifth loss, and said sixth loss, comprising:
Calculating a second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss, and the seventh loss.
9. The method of claim 8, wherein the calculating a second target loss from the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss, and the seventh loss comprises:
Acquiring training rounds;
Calculating a third weight coefficient corresponding to each loss based on the training round;
and carrying out weighted calculation on the first loss, the second loss, the third loss, the fourth loss, the fifth loss, the sixth loss and the seventh loss based on the third weight coefficient corresponding to each loss, so as to obtain a second target loss.
10. The method according to claim 2, wherein the performing speech recognition on the sample speech to obtain a corresponding sample text, and performing text feature extraction on the sample text based on a fourth neural network model to obtain the second sample text feature, further comprises:
inputting the first text feature into a fifth neural network model for feature compression, and inputting the compressed feature into a sixth neural network model for semantic projection to obtain a fifth sentence feature;
inputting the second sample text features into a fifth neural network model for feature compression, and inputting the compressed features into a sixth neural network model for semantic projection to obtain sixth sentence features;
calculating an eighth loss according to the fifth statement feature and the sixth statement feature;
The updating model parameters of the first, second, and third neural network models based on the first and second losses includes:
model parameters of the first, second, and third neural network models are updated based on the first, second, and eighth losses.
11. The method of claim 1, wherein before feature decoding the text feature based on the third neural network model to obtain the speech relationship text of the target speech data, further comprising:
Performing language identification on the target voice data to obtain a language description text;
Extracting text features of the language description text to obtain language text features;
Splicing the language text features and the text features to obtain target text features;
The feature decoding is performed on the text feature based on the third neural network model to obtain the voice relation text of the target voice data, and the method comprises the following steps:
And performing feature decoding on the target text features based on a third neural network model to obtain the voice relation text of the target voice data.
12. A speech relationship extraction apparatus, the apparatus comprising:
the acquisition unit is used for acquiring target voice data to be subjected to voice relation extraction;
The extraction unit is used for extracting the voice characteristics of the target voice data based on the first neural network model to obtain the voice characteristics;
The conversion unit is used for inputting the voice characteristics into a second neural network model to perform characteristic mode conversion to obtain text characteristics;
The decoding unit is used for performing feature decoding on the text features based on a third neural network model to obtain a voice relation text of the target voice data;
Wherein the first neural network model, the second neural network model, and the third neural network model are obtained based on a target loss joint training, the target loss including a first loss and a second loss; the first loss is calculated based on a first sample text feature and a second sample text feature, the first sample text feature is obtained by extracting a voice feature of sample voice through the first neural network model and then performing feature mode conversion through the second neural network model, and the second sample text feature is obtained by extracting a text feature of sample text corresponding to the sample voice; the second loss is calculated based on a predicted voice relation text and a voice relation label, and the predicted voice relation text is obtained by performing feature decoding on the first text sample feature by the third neural network model.
13. A storage medium storing a computer program, wherein the computer program when executed by a processor implements the speech relation extraction method of any one of claims 1 to 11.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the speech relation extraction method of any one of claims 1 to 11 when executing the computer program.
15. A computer program product comprising a computer program which is read and executed by a processor of a computer device to cause the computer device to perform the speech relationship extraction method of any one of claims 1 to 11.
CN202410524510.2A 2024-04-29 Voice relation extraction method, device, computer equipment and storage medium Active CN118098222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410524510.2A CN118098222B (en) 2024-04-29 Voice relation extraction method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410524510.2A CN118098222B (en) 2024-04-29 Voice relation extraction method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN118098222A CN118098222A (en) 2024-05-28
CN118098222B true CN118098222B (en) 2024-07-05

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822084A (en) * 2021-07-15 2021-12-21 腾讯科技(深圳)有限公司 Statement translation method and device, computer equipment and storage medium
CN115168608A (en) * 2022-07-12 2022-10-11 成都航天科工大数据研究院有限公司 Mud pump fault diagnosis method based on multi-mode and deep semantic mining technology

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822084A (en) * 2021-07-15 2021-12-21 腾讯科技(深圳)有限公司 Statement translation method and device, computer equipment and storage medium
CN115168608A (en) * 2022-07-12 2022-10-11 成都航天科工大数据研究院有限公司 Mud pump fault diagnosis method based on multi-mode and deep semantic mining technology

Similar Documents

Publication Publication Date Title
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
CN112185352B (en) Voice recognition method and device and electronic equipment
US11355097B2 (en) Sample-efficient adaptive text-to-speech
US12008336B2 (en) Multimodal translation method, apparatus, electronic device and computer-readable storage medium
WO2020155619A1 (en) Method and apparatus for chatting with machine with sentiment, computer device and storage medium
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN112802444B (en) Speech synthesis method, device, equipment and storage medium
CN112543932A (en) Semantic analysis method, device, equipment and storage medium
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
CN111767697B (en) Text processing method and device, computer equipment and storage medium
CN112837669A (en) Voice synthesis method and device and server
CN113674732A (en) Voice confidence detection method and device, electronic equipment and storage medium
CN116306603A (en) Training method of title generation model, title generation method, device and medium
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN114999443A (en) Voice generation method and device, storage medium and electronic equipment
CN117234369B (en) Digital human interaction method and system, computer readable storage medium and digital human equipment
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN118098222B (en) Voice relation extraction method, device, computer equipment and storage medium
CN115169368B (en) Machine reading understanding method and device based on multiple documents
CN116415597A (en) Speech translation and simultaneous interpretation method
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN118098222A (en) Voice relation extraction method, device, computer equipment and storage medium
CN113792537A (en) Action generation method and device
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN114333772A (en) Speech recognition method, device, equipment, readable storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant