CN115688937A - Model training method and device - Google Patents

Model training method and device Download PDF

Info

Publication number
CN115688937A
CN115688937A CN202211350390.6A CN202211350390A CN115688937A CN 115688937 A CN115688937 A CN 115688937A CN 202211350390 A CN202211350390 A CN 202211350390A CN 115688937 A CN115688937 A CN 115688937A
Authority
CN
China
Prior art keywords
data
input data
discrete
representation
mapping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211350390.6A
Other languages
Chinese (zh)
Inventor
邓利群
陈晓
岳祥虎
李良友
曾幸山
郑念祖
李鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202211350390.6A priority Critical patent/CN115688937A/en
Publication of CN115688937A publication Critical patent/CN115688937A/en
Pending legal-status Critical Current

Links

Images

Abstract

A model training method is applied to multi-modal data processing, relates to the field of artificial intelligence, and comprises the following steps: acquiring a first characteristic representation and a second characteristic representation, wherein first input data is data of a first mode; the second input data is data of a second mode; mapping the first feature representation to a first discrete label through a first mapping network; mapping the second feature representation to a second discrete label through a second mapping network; the difference between the second discrete label and the first discrete label is used to update the second mapping network; executing a first target task through a first task network according to the first discrete label to obtain a first result; and executing a second target task through a second task network according to the second discrete label to obtain a second result. According to the method and the device, the feature representations of the data in different modes are mapped into the same discrete space, modeling can be performed on the feature representations of the multiple modes based on the discrete space, and a model compatible with the input data of the multiple modes is obtained.

Description

Model training method and device
Technical Field
The application relates to the field of artificial intelligence, in particular to a model training method and a device thereof.
Background
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
In recent years, with the maturity of deep neural network technology and the enhancement of data and computational power, large-scale pre-training models have been greatly developed in the fields of natural language processing/computer Vision, etc., and a number of representative models, such as Bert, BART and GPT series models in Natural Language Processing (NLP), viT (Vision Transformer), MOCO models in computer Vision, etc., and Wav2Vec, hubert models in the speech field, etc., have emerged. One of the biggest points in common for these models to succeed is the use of Self-Supervised Learning (SSL) technology. SSL enables a model to make full use of a large number of unlabelled data from different sources, from which extremely generalized feature characterizations are learned.
Further, as the contrast learning/cross-attention Transformer technique is extended from single-modality input to multi-modality, pre-training models for multi-modality data are also widely regarded, such as speech-text pre-training models speech 5, SLAM, etc., visual text pre-training models CLIP, DALL-E, etc. For input data of different modalities, such models aim to learn semantic features common among them.
However, content feature characterization is very different between different modalities, and a model compatible with multi-modal input data is needed.
Disclosure of Invention
In a first aspect, the present application provides a model training method, including: acquiring a first feature representation and a second feature representation, wherein the first feature representation is obtained by processing first input data through an encoder, and the second feature representation is obtained by processing second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different; mapping the first feature representation to a first discrete label through a first mapping network; mapping the second feature representation to a second discrete label through a second mapping network; the difference between the second discrete label and the first discrete label is used to update the second mapping network; executing a first target task through a first task network according to the first discrete label to obtain a first result; and executing a second target task through a second task network according to the second discrete label to obtain a second result.
In one possible implementation, the first feature representation may be mapped to a first discrete space through a first mapping network, resulting in a first discrete label; and mapping the second feature representation to a second discrete space through a second mapping network to obtain a second discrete label.
Wherein a discrete label may be one or more numerical values in a discrete space.
In one possible implementation, the first task network and the second task network may be the same or different, and the first target task and the second target task may be the same or different. For example, the first task network and the second task network may both be language models, and the first target task and the second target task are natural language processing tasks, such as masked text prediction, text reply, and the like. For example, the first task network and the second task network may obtain input data for a generation class task of the corresponding modality, e.g., restore.
By mapping the feature representations of the data of different modes into the same discrete space and enabling the discrete labels mapped by the features of different modes with the same semantics to be the same, different input data can be uniformly and indiscriminately used for training, and therefore modeling can be performed on the feature representations of multiple modes based on the discrete space to obtain a model compatible with the multiple-mode input data.
In one possible implementation, the first discrete label includes a first sub-representation and a second sub-representation; the executing a first target task through a first task network according to the first discrete tag to obtain a first result, including: predicting data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
In one possible implementation, the second discrete label includes a third sub-representation and a fourth sub-representation; the executing the target task through the language model according to the second discrete tag to obtain a second result, including: predicting the data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
In one possible implementation, a pre-trained model (which may be referred to as a language model subsequently) may be trained based on the first discrete label and the second discrete label, for example, a portion of the discrete label may be masked and predicted. Since discrete labels are known per se, it amounts to performing an auto-supervised learning of a pre-trained model.
In one possible implementation, the second discrete label includes a third sub-representation and a fourth sub-representation; the executing a second target task through a second task network according to the second discrete tag to obtain a second result, including: predicting the data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
In one possible implementation, the method further comprises: according to the second sub-representation and the second result, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
In one possible implementation, the method further comprises: according to the first discrete label, a first prediction result corresponding to the first input data is obtained through a first decoder corresponding to the first mode; the difference between the first prediction result and the first input data is used to update the first decoder.
In one possible implementation, the method further comprises: according to the second discrete label, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
In one possible implementation, the first modality and the second modality are one of speech, text, or images; alternatively, the first and second liquid crystal display panels may be,
the first result is related to or generated according to the semantic features of the first input data; alternatively, the first and second electrodes may be,
the second result is related to or generated from the semantic features of the second input data.
In one possible implementation, the method further comprises: obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data; the executing the target task through the language model according to the first discrete tag comprises: and executing the target task through a language model according to the first discrete label and the style data. Wherein, for text data, the style data may illustratively relate to at least one of: the font, the font size, whether the text is thickened or not, whether the text is italic or not, the font color and the like; for image data, the style data may illustratively relate to at least one of: whether the image is shot for a real object, whether the image is a cartoon image, whether the image is a black-and-white image, whether the image is a gray image and the like; for speech data, the style data may illustratively relate to at least one of: fundamental frequency (pitch), energy (energy), speaker timbre (speech), prosody, etc.
In a second aspect, the present application provides a data processing method, including:
acquiring a first feature representation and a second feature representation, wherein the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first modality; the second input data is data of a second mode; the first modality and the second modality are different;
mapping the first feature representation to a first discrete label through a first mapping network;
mapping the second feature representation to a second discrete label through a second mapping network; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
In one possible implementation, when training the first mapping network and the second mapping network, the truth labels of the input data having the same semantics are the same discrete values.
In a possible implementation, in training the second mapping network, a truth label corresponding to input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic meaning as the input data of the second mapping network.
In one possible implementation, the method further comprises:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
the executing the target task through the language model according to the first discrete tag comprises:
and executing a target task through a language model according to the first discrete label and the style data.
In a third aspect, the present application provides a model training apparatus, the apparatus comprising:
an obtaining module, configured to obtain a first feature representation and a second feature representation, where the first feature representation is obtained by processing a first input data through an encoder, and the second feature representation is obtained by processing a second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second modality; the first modality and the second modality are different;
a mapping module for mapping the first feature representation to a first discrete label over a first mapping network;
mapping the second feature representation to a second discrete label through a second mapping network; the difference between the second discrete label and the first discrete label is used to update the second mapping network;
the task module is used for executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing the target task through the language model according to the second discrete label to obtain a second result.
In one possible implementation, the first discrete label includes a first sub-representation and a second sub-representation; the task module is specifically configured to:
predicting data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
In one possible implementation, the second discrete label includes a third sub-representation and a fourth sub-representation; the task module is specifically configured to:
predicting data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
In one possible implementation, the task module is further configured to:
obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first modality according to the first sub-representation and the first result; the difference between the first prediction result and the first input data is used to update the first decoder.
In one possible implementation, the task module is further configured to:
according to the second sub-representation and the second result, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
In one possible implementation, the method further comprises: according to the first discrete label, obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first mode; the difference between the first prediction result and the first input data is used to update the first decoder.
In one possible implementation, the first modality and the second modality are one of speech, text, or images; alternatively, the first and second electrodes may be,
the first result is related to or generated according to the semantic features of the first input data; alternatively, the first and second liquid crystal display panels may be,
the second result is related to or generated from the semantic features of the second input data.
In a possible implementation, the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing the target task through a language model according to the first discrete label and the style data.
In a fourth aspect, the present application provides a data processing apparatus, the apparatus comprising:
an obtaining module, configured to obtain a first feature representation and a second feature representation, where the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different;
a mapping module for mapping the first feature representation to a first discrete label via a first mapping network;
mapping, by a second mapping network, the second feature representation to a second discrete label; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
the task module is used for executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
In one possible implementation, when training the first mapping network and the second mapping network, the truth labels of the input data with the same semantics are the same discrete values.
In a possible implementation, in training the second mapping network, a truth label corresponding to input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic meaning as the input data of the second mapping network.
In one possible implementation, the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing the target task through a language model according to the first discrete label and the style data.
In a fifth aspect, embodiments of the present application provide a training apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used for storing programs, and the processor is used for executing the programs in the memory to perform the method according to the first aspect and any optional method thereof.
In a sixth aspect, an embodiment of the present application provides an execution apparatus, which may include a memory, a processor, and a bus system, where the memory is used for storing a program, and the processor is used for executing the program in the memory to execute the second aspect and any optional method thereof.
In a seventh aspect, this application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the first aspect and any optional method thereof, and the second aspect and any optional method thereof.
In an eighth aspect, embodiments of the present application provide a computer program which, when run on a computer, causes the computer to perform the first aspect and any optional method thereof, and the second aspect and any optional method thereof.
In a ninth aspect, the present application provides a chip system, which comprises a processor, configured to support an execution data processing apparatus to implement the functions referred to in the above aspects, for example, to transmit or process data referred to in the above methods; or, information. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.
Drawings
FIG. 1A is a schematic diagram of an artificial intelligence framework;
FIGS. 1B and 1C are schematic diagrams of an application system framework according to an embodiment of the present application;
fig. 1D is a schematic diagram of an alternative hardware structure of the terminal;
FIG. 2 is a schematic diagram of a server;
FIG. 3 is a schematic diagram of a system architecture of the present application;
FIG. 4 is a flow diagram of a cloud service;
fig. 5 is a flowchart illustrating a model training method according to an embodiment of the present application;
fig. 6 to fig. 13 are schematic processing diagrams of a model training method provided in an embodiment of the present application;
fig. 14 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an execution device according to an embodiment of the present application;
FIG. 16 is a schematic structural diagram of a training apparatus according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.
Detailed Description
The embodiments of the present application will be described below with reference to the drawings. The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.
Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The terms "first," "second," and the like in the description and claims of this application and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The terms "substantially", "about" and the like are used herein as terms of approximation and not as terms of degree, and are intended to take into account the inherent deviations in measured or calculated values that would be known to one of ordinary skill in the art. Furthermore, the use of "may" in describing embodiments of the present application refers to "one or more embodiments possible". As used herein, the terms "use," "using," and "used" may be considered synonymous with the terms "utilizing," "utilizing," and "utilized," respectively. Additionally, the term "exemplary" is intended to refer to an instance or illustration.
Referring to fig. 1A, fig. 1A is a schematic structural diagram of an artificial intelligence subject framework, which is explained below from two dimensions of an "intelligent information chain" (horizontal axis) and an "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.
(1) Infrastructure arrangement
The infrastructure provides computing power support for the artificial intelligent system, communication with the outside world is achieved, and support is achieved through the foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.
(2) Data of
Data at a level above the infrastructure is used to represent a source of data for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) Data processing
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.
Decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sorting, prediction and the like.
(4) General capabilities
After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.
(5) Intelligent product and industrial application
The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..
The method and the device can be applied to the field of natural language processing in the field of artificial intelligence, and a plurality of application scenes falling to products are introduced by taking natural language processing as an example.
First, an application scenario of the present application is described, but the present application can be applied to, but is not limited to, an application having a multi-modal processing function for images, texts, audio, or the like (hereinafter, may be simply referred to as a multi-modal processing application), a cloud service provided by a cloud-side server, and the following description is separately provided:
1. multimodal processing class application
The product modality of the embodiments of the application can be a multimodal processing class application. The multi-modal processing application can run on a terminal device or a server on the cloud side.
In one possible implementation, the multimodal processing application can perform the task of processing multimodal data to obtain a processing result. I.e. the same processing model, can process input data of multiple modalities.
For example, for a speech translation (e.g., chinese to english) task: during pre-training, a large amount of data of a voice mode and a text mode without labels can be used for training a language model, and in a task customization stage, a final translation model can be obtained only by a small amount of paired middle-translated English materials;
for another example, for the OCR reading task: the language model is pre-trained by using data of different modes such as OCR images, texts, voices and the like, and a small amount of pairs of parallel corpora of the OCR images and the texts or the OCR images and the voices can be used for training a model scheme for directly recognizing and reading the content of the OCR images and the texts in a task customization stage.
Likewise, similar scenarios are not limited to content generation and recognition tasks within any cross-modality or single modality.
It should be understood that the examples herein are only for facilitating understanding of the application scenarios of the embodiments of the present application, and are not exhaustive.
In a possible implementation, a user may open a multimodal processing application installed on a terminal device, and input multimodal data such as images, texts, or audios, and the multimodal processing application may process images through a multimodal model obtained by training with the method provided in the embodiment of the present application, and present a processing result to the user (a presentation manner may be, but is not limited to, displaying, saving, uploading to a cloud side, and the like).
In a possible implementation, a user can open a multi-modal processing application installed on a terminal device and input multi-modal data such as images, texts, or audios, the multi-modal processing application can send the multi-modal data such as the images, the texts, or the audios to a cloud-side server, the cloud-side server processes the images through a multi-modal model obtained by training through the method provided by the embodiment of the application and returns the processing results to the terminal device, and the terminal device can present the processing results to the user (the presentation manner can be, but is not limited to, displaying, saving, uploading to the cloud side, and the like).
Next, the multi-modal processing class application in the embodiment of the present application is described from a functional architecture and a product architecture for realizing the function, respectively.
Referring to fig. 1B, fig. 1B is a schematic diagram of a functional architecture of a multi-modal processing application in an embodiment of the present application:
in one possible implementation, as shown in FIG. 1B, a multimodal processing class application 102 may receive input parameters 101 (e.g., including images) and generate processing results 103. The multimodal processing class application 102 is executable on, for example, at least one computer system and comprises computer code that, when executed by one or more computers, causes the computers to execute multimodal models for performing training by the methods provided by embodiments of the present application.
Referring to fig. 1C, fig. 1C is a schematic diagram of an entity architecture for running a multimodal processing application in an embodiment of the present application:
referring to FIG. 1C, FIG. 1C illustrates a system architecture diagram. The system may include a terminal 100, and a server 200. Where the server 200 can include one or more servers (illustrated in fig. 1C as including one server as an example), the server 200 can provide multimodal processing functionality for one or more terminals.
The terminal 100 may be installed with a multi-modal processing application program, or open a web page related to a cross-modal language processing function, where the application program and the web page may provide an interface, the terminal 100 may receive a related parameter input by a user on the cross-modal language processing function interface, and send the parameter to the server 200, and the server 200 may obtain a processing result based on the received parameter, and return the processing result to the terminal 100.
It should be understood that, in some alternative implementations, the terminal 100 may also perform an action for obtaining a processing result based on the received parameters by itself, and the implementation does not need to be implemented in cooperation with a server, and the embodiment of the present application is not limited.
The product form of the terminal 100 in fig. 1C is described next;
the terminal 100 in this embodiment of the application may be a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, which is not limited in this embodiment of the application.
Fig. 1D shows an alternative hardware structure diagram of the terminal 100.
Referring to fig. 1D, the terminal 100 may include a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, a power supply 190, and the like. Those skilled in the art will appreciate that fig. 1D is merely an example of a terminal or multi-function device and is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or different components.
The input unit 130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the portable multifunction device. In particular, the input unit 130 may include a touch screen 131 (optional) and/or other input devices 132. The touch screen 131 may collect touch operations of a user (e.g., operations of the user on or near the touch screen using any suitable object such as a finger, a joint, a stylus, etc.) and drive the corresponding connection device according to a preset program. The touch screen can detect the touch action of the user on the touch screen, convert the touch action into a touch signal and send the touch signal to the processor 170, and can receive and execute a command sent by the processor 170; the touch signal includes at least contact point coordinate information. The touch screen 131 may provide an input interface and an output interface between the terminal 100 and a user. In addition, the touch screen may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 130 may include other input devices in addition to the touch screen 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The input device 132 may receive multi-modal data such as input images, text, or audio.
The display unit 140 may be used to display information input by or provided to a user, various menus, interactive interfaces, file displays of the terminal 100, and/or playback of any one of multimedia files. In the embodiment of the present application, the display unit 140 may be configured to display an interface, a processing result, and the like of the multimodal processing application.
The memory 120 may be used to store instructions and data, and the memory 120 may mainly include an instruction storage area and a data storage area, where the data storage area may store various data, such as multimedia files, texts, etc.; the storage instruction area may store software elements such as an operating system, an application, instructions required for at least one function, or a subset, an extended set, etc. thereof. Non-volatile random access memory may also be included; providing the processor 170 includes managing hardware, software, and data resources in the computing processing device, supporting control software and applications. But also for the storage of multimedia files, and for the storage of running programs and applications.
The processor 170 is a control center of the terminal 100, connects various parts of the entire terminal 100 using various interfaces and lines, performs various functions of the terminal 100 and processes data by operating or executing instructions stored in the memory 120 and calling data stored in the memory 120, thereby performing overall control of the terminal device. Alternatively, processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 170. In some embodiments, the processor, memory, and/or the like may be implemented on a single chip, or in some embodiments, they may be implemented separately on separate chips. The processor 170 may also be used for generating corresponding operation control signals, sending the corresponding operation control signals to the corresponding components of the computing and processing device, reading and processing data in software, and particularly reading and processing data and programs in the memory 120, so as to enable the respective functional modules therein to execute corresponding functions, thereby controlling the corresponding components to perform actions according to the instructions.
The memory 120 may be configured to store software codes related to a data processing method, and the processor 170 may execute steps of the data processing method of the chip, and may schedule other units (e.g., the input unit 130 and the display unit 140) to implement corresponding functions.
The rf unit 110 (optionally) may be used for receiving and transmitting information or receiving and transmitting signals during a call, for example, receiving downlink information of a base station, and then processing the downlink information to the processor 170; in addition, the data for designing uplink is transmitted to the base station. Typically, the RF circuit includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the radio frequency unit 110 may also communicate with network devices and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Messaging Service (SMS), etc.
In the embodiment of the present application, the radio frequency unit 110 may send multi-modal data such as images, texts, or audios to the server 200, and receive a processing result sent by the server 200.
It should be understood that the rf unit 110 is optional and may be replaced by other communication interfaces, such as a network port.
The terminal 100 further includes a power supply 190 (e.g., a battery) for supplying power to the various components, which may preferably be logically connected to the processor 170 via a power management system, such that functions of managing charging, discharging, and power consumption are performed via the power management system.
The terminal 100 further includes an external interface 180, which may be a standard Micro USB interface, or a multi-pin connector, which may be used to connect the terminal 100 to communicate with other devices, or a charger to charge the terminal 100.
Although not shown, the terminal 100 may further include a flash, a wireless fidelity (WiFi) module, a bluetooth module, a sensor with different functions, and the like, which are not described in detail herein. Some or all of the methods described below may be applied in the terminal 100 as shown in fig. 1D.
The product form of the server 200 in fig. 1C is described next;
fig. 2 provides a schematic diagram of a server 200, and as shown in fig. 2, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. The processor 202, memory 204, and communication interface 203 communicate via a bus 201.
The bus 201 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 2, but that does not indicate only one bus or one type of bus.
The processor 202 may be any one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Micro Processor (MP), a Digital Signal Processor (DSP), and the like.
Memory 204 may include volatile memory (volatile memory), such as Random Access Memory (RAM). The memory 204 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a hard drive (HDD) or a Solid State Drive (SSD).
The memory 204 may be configured to store software codes related to the data processing method, and the processor 202 may execute steps of the data processing method of the chip, and may also schedule other units to implement corresponding functions.
It should be understood that the terminal 100 and the server 200 may be centralized or distributed devices, and the processors (e.g., the processor 170 and the processor 202) in the terminal 100 and the server 200 may be hardware circuits (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor, a microcontroller, etc.), or a combination of these hardware circuits, for example, the processor may be a hardware system having a function of executing instructions, such as a CPU, a DSP, etc., or a hardware system having no function of executing instructions, such as an ASIC, an FPGA, etc., or a combination of the above hardware system having no function of executing instructions and a hardware system having function of executing instructions.
It should be understood that the steps related to the model inference process in the embodiment of the present application relate to AI-related operations, and the instruction execution architecture of the terminal device and the server when executing the AI operations is not limited to the architecture of the processor and the memory described above. The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 3.
Fig. 3 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 3, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.
The execution device 510 includes a computation module 511, an I/O interface 512, a pre-processing module 513, and a pre-processing module 514. The goal model/rules 501 may be included in the calculation module 511, with the pre-processing module 513 and the pre-processing module 514 being optional.
The execution device 510 may be the terminal device or the server running the multimodal processing application.
The data acquisition device 560 is used to acquire training samples. The training samples can be multi-modal data such as images, text or audio. After the training samples are collected, the data collection device 560 stores the training samples in the database 530.
The training device 520 can obtain the target model/rule 501 based on training samples maintained in the database 530 for a neural network to be trained (e.g., a multi-modal model (e.g., including an encoder, a mapping network, a decoder, etc.) in the embodiments of the present application).
It should be understood that the training device 520 may perform a pre-training process on the neural network to be trained based on the training samples maintained in the database 530, or perform a fine-tuning of the model based on the pre-training process.
It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily all collected from the data collection device 560, and may be received from other devices. It should be noted that, the training device 520 does not necessarily perform the training of the target model/rule 501 based on the training samples maintained by the database 530, and may also obtain the training samples from the cloud or other places for performing the model training, and the above description should not be taken as a limitation on the embodiment of the present application.
The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, for example, the executing device 510 shown in fig. 3, where the executing device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or a server.
Specifically, the training device 520 may pass the trained model to the execution device 510.
In fig. 3, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user can input data (e.g., multimodal data such as image, text or audio in the embodiment of the present application) to the I/O interface 512 through a client device 540.
The pre-processing module 513 and the pre-processing module 514 are used for pre-processing according to the input data received by the I/O interface 512. It should be understood that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the pre-processing module 513 and the pre-processing module 514 are not present, the input data may be processed directly using the calculation module 511.
During the process of preprocessing the input data by the execution device 510 or performing the calculation and other related processes by the calculation module 511 of the execution device 510, the execution device 510 may call the data, the code and the like in the data storage system 550 for corresponding processes, or store the data, the instruction and the like obtained by corresponding processes in the data storage system 550.
Finally, the I/O interface 512 provides the processing results to the client device 540 and thus to the user.
In the case shown in fig. 3, the user can manually give input data, and this "manually give input data" can be operated through an interface provided by the I/O interface 512. Alternatively, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 540. The user can view the result output by the execution device 510 at the client device 540, and the specific presentation form can be display, sound, action, and the like. The client device 540 may also serve as a data collection terminal, collecting input data of the input I/O interface 512 and output results of the output I/O interface 512 as new sample data, as shown, and storing the new sample data in the database 530. Of course, the input data of the input I/O interface 512 and the output result of the output I/O interface 512 may be directly stored in the database 530 as new sample data by the I/O interface 512 without being collected by the client device 540.
It should be noted that fig. 3 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 3, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It is to be appreciated that the execution device 510 described above can be deployed in the client device 540.
From the inference side of the model:
in this embodiment, the computing module 511 of the executing device 520 may obtain the codes stored in the data storage system 550 to implement the steps related to the model inference process in this embodiment.
In this embodiment, the computing module 511 of the execution device 520 may include a hardware circuit (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor, a microcontroller, or the like), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system having a function of executing instructions, such as a CPU, a DSP, or a hardware system having no function of executing instructions, such as an ASIC, an FPGA, or the like, or a combination of the above hardware system having no function of executing instructions and a hardware system having a function of executing instructions.
Specifically, the computing module 511 of the execution device 520 may be a hardware system having a function of executing instructions, the steps related to the model inference process provided in the embodiment of the present application may be software codes stored in a memory, and the computing module 511 of the execution device 520 may acquire the software codes from the memory and execute the acquired software codes to implement the steps related to the model inference process provided in the embodiment of the present application.
It should be understood that the computing module 511 of the executing device 520 may be a hardware system without instruction executing function and a combination of hardware systems with instruction executing function, and some of the steps related to the model inference process provided in the embodiments of the present application may also be implemented by a hardware system without instruction executing function in the computing module 511 of the executing device 520, which is not limited herein.
From the training side of the model:
in this embodiment, the training device 520 may obtain codes stored in a memory (not shown in fig. 3, and may be integrated with the training device 520 or separately deployed from the training device 520) to implement steps related to model training in this embodiment.
In this embodiment, the training device 520 may include a hardware circuit (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, a DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, an FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.
It should be understood that the training device 520 may be a combination of a hardware system without a function of executing instructions and a hardware system with a function of executing instructions, and some steps related to model training provided in the embodiments of the present application may also be implemented by a hardware system without a function of executing instructions in the training device 520, which is not limited herein.
2. The server provides a multi-modal processing function cloud service:
in one possible implementation, the server may provide the services of cross-modal language processing functionality to the end-side through an Application Programming Interface (API).
The terminal device may send related parameters (e.g., multimodal data such as images, texts, audio, etc.) to the server through an API provided by the cloud, and the server may obtain a processing result based on the received parameters, and return the processing result to the terminal.
The description of the terminal and the server may be the description of the above embodiments, and will not be repeated here.
Fig. 4 shows a flow of a multi-modal processing function-based cloud service provided using a cloud platform.
1. And opening and purchasing content auditing service.
2. A user may download a Software Development Kit (SDK) corresponding to the content auditing service, and generally, the cloud platform provides multiple development versions of SDKs for the user to select according to requirements of a development environment, for example, an SDK of a JAVA version, an SDK of a python version, an SDK of a PHP version, an SDK of an Android version, and the like.
3. After downloading the SDK with the corresponding version to the local according to the requirement, the user imports the SDK project into the local development environment, carries out configuration and debugging in the local development environment, and can also carry out development of other functions in the local development environment, so that an application integrating multi-mode processing function capability is formed.
4. When the multimodal processing function application needs to perform a cross-modal language processing function while being used, the multimodal processing function application may trigger an API call of the cross-modal language processing function. When the application triggers the cross-modal language processing function, an API request is sent to an operation instance of the multi-modal processing function service in the cloud environment, wherein the API request carries an image, and the operation instance in the cloud environment processes the image to obtain a processing result.
5. And the cloud environment returns the processing result to the application, so that one-time multi-modal processing function calling is completed.
Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.
(1) Neural network
The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:
Figure BDA0003919410200000141
wherein s =1, 2, \8230, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting together a plurality of the above-mentioned single neural units, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(2) transformer layer
The neural network comprises an embedding layer and at least one transformer layer, wherein the at least one transformer layer can be N transformer layers (N is an integer larger than 0), wherein each transformer layer comprises an attention layer, an adding and normalizing (add & norm) layer, a feed forward (feed forward) layer and an adding and normalizing layer which are sequentially adjacent. In an embedding layer, embedding processing is carried out on the current input to obtain a plurality of embedding vectors; in the attention layer, acquiring P input vectors from a layer above the first transform layer, taking any first input vector of the P input vectors as a center, and obtaining intermediate vectors corresponding to the first input vectors based on the association degree between each input vector and the first input vector in a preset attention window range, so as to determine P intermediate vectors corresponding to the P input vectors; at the pooling layer, the P intermediate vectors are merged into Q output vectors, wherein a plurality of output vectors from a last one of the transform layers are used as a feature representation of the current input.
(3) Attention mechanism (attention mechanism)
The attention mechanism simulates the internal process of biological observation behavior, namely a mechanism which aligns internal experience and external feeling so as to increase the observation fineness of partial areas, and can rapidly screen out high-value information from a large amount of information by using limited attention resources. Attention-driven mechanisms can quickly extract important features of sparse data and are therefore widely used for natural language processing tasks, particularly machine translation. The self-attention mechanism (self-attention mechanism) is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:
the equation meaning means that a constituent element in the Source is imagined to be composed of a series of data pairs, at this time, a certain element Query in the Target is given, a weight coefficient of Value corresponding to each Key is obtained by calculating similarity or correlation between the Query and each Key, and then the Value is subjected to weighted summation, so that a final Attentition numerical Value is obtained. So essentially the Attenttion mechanism is to perform weighted summation on the Value values of the elements in Source, and Query and Key are used to calculate the weight coefficients of the corresponding Value. Conceptually, attention can be understood as selectively screening out and focusing on a small amount of important information from a large amount of information, ignoring most of the important information. The focusing process is embodied in the calculation of the weight coefficient, the greater the weight is, the more the weight is focused on the Value corresponding to the weight coefficient, namely, the weight represents the importance of the information, and the Value is the corresponding information. The self-Attention mechanism may be understood as an internal Attention mechanism (Attention), which occurs between all elements in the Target element Query and Source, or an Attention mechanism occurring between Source internal elements or between Target internal elements, or an Attention calculation mechanism in a special case of Target = Source, and a specific calculation process is the same, but a calculation object is changed.
(4) Natural Language Processing (NLP)
Natural language (natural language) is human language, and Natural Language Processing (NLP) is processing of human language. Natural language processing is a process for systematic analysis, understanding and information extraction of text data in an intelligent and efficient manner. By using NLPs and their components, we can manage very large blocks of text data, or perform a large number of automated tasks, and solve a wide variety of problems, such as automatic summarization (automatic summarization), machine Translation (MT), named Entity Recognition (NER), relationship Extraction (RE), information Extraction (IE), emotion analysis, speech recognition (speech recognition), query answering system (query answering), and topic segmentation.
(5) Pre-trained language model (pre-trained language model)
The pre-trained language model is a natural language sequence coder, which codes each word in the natural language sequence into a vector representation to perform the prediction task. Its training comprises two phases. In the pre-training (pre-training) phase, the model trains the language model task over large-scale unsupervised text, learning a word representation. In a fine tuning (tuning) stage, the model is initialized by using parameters learned in a pre-training stage, and few steps of training are performed on downstream tasks (downlink tasks) such as text classification (text classification), sequence labeling (sequence labeling) and the like, so that semantic information obtained by pre-training can be successfully transferred to the downstream tasks.
(6) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss is generated when the input signal is transmitted forward until the input signal is output, and parameters in the initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.
(7) Loss function
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close as possible to the value really expected to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value really expected to be target, and then according to the difference situation between the predicted value and the value really expected to be target (of course, an initialization process is usually carried out before the first update, namely parameters are configured for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower in prediction, and the adjustment is continuously carried out until the deep neural network can predict the value really expected to be target or a value very close to the value really expected to be target. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the greater the difference, the training of the deep neural network becomes a process of reducing the loss as much as possible.
(8) Encoder/decoder
The encoder and decoder typically exist in pairs, like a sequence2sequence model, i.e. consisting of at least one encoder and at least one decoder. The core of the work is that the encoder encodes the input raw data into some kind of intermediate feature, and the decoder decodes the intermediate feature into the target result.
In recent years, with the maturity of deep neural network technology and the enhancement of data and computational power, large-scale pre-training models have been greatly developed in the fields of natural language processing/computer Vision, etc., and a number of representative models, such as Bert, BART and GPT series models in Natural Language Processing (NLP), viT (Vision Transformer), MOCO models in computer Vision, etc., and Wav2Vec, hubert models in the speech field, etc., have emerged. One of the biggest points that these models have been successful in using Self-Supervised Learning (SSL) technology. SSL enables a model to make full use of a large number of unlabelled data from different sources, from which extremely generalized feature characterizations are learned.
Further, as the contrast learning/cross-attention Transformer technique is extended from single-modality input to multi-modality, pre-training models for multi-modality data are also widely regarded, such as speech-text pre-training models speech 5, SLAM, etc., visual text pre-training models CLIP, DALL-E, etc. For input data of different modalities, such models aim to learn semantic features common among them.
However, content feature characterization is very different between different modalities, and a model compatible with multi-modal input data is needed.
In order to solve the above problem, an embodiment of the present application provides a data processing method. The data processing method according to the embodiment of the present application is described in detail below with reference to the drawings.
Referring to fig. 5 and fig. 5 are schematic flowcharts of a model training method provided in the embodiment of the present application, and as shown in fig. 5, the model training method provided in the embodiment of the present application may include steps 501 to 505, which are described in detail below.
501. Acquiring a first feature representation and a second feature representation, wherein the first feature representation is obtained by processing first input data through an encoder, and the second feature representation is obtained by processing second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different;
in a possible implementation, steps 501 and 505 may be performed during a pre-training process of the network, or may be performed during a fine-tuning process of the pre-training model.
In a possible implementation, data of a certain modality, for example, first input data or second input data in the embodiment of the present application, may be acquired, where the first input data and the second input data may be training samples in training a model, and the first input data is data of the first modality; the second input data is data of a second mode; the first modality and the second modality are different.
In one possible implementation, the first modality and the second modality are one of speech, text, or images; alternatively, the first and second electrodes may be,
for example, the first modality may be speech and the second modality may be text;
for example, the first modality may be speech, and the second modality may be an image;
for example, the first modality may be text and the second modality may be speech;
for example, the first modality may be text and the second modality may be an image;
for example, the first modality may be an image and the second modality may be speech;
for example, the first modality may be an image and the second modality may be text.
In one possible implementation, corresponding encoders (encoders) may be deployed for input data of different modalities. That is, for different modalities, a corresponding feature Encoder (such as spechencoder, text Encoder, image Encoder, etc.) of the modality is constructed; optionally, the Encoder may extract at least two sets of features from the original input data, namely content feature (content) and Style feature (i.e. out-of-content feature).
In one possible implementation, for text data, discrete tokens in the text data may be encoded; for image data, the image may be divided into patches (patch), and each patch is then modeled as a discrete token for encoding; for textual speech data, the speech signal may be represented as continuous frequency domain features in frames, which are then used for model modeling and downstream task processing. New research, represented by Meta's recent Textless NLP work, suggests that speech frames can also represent discrete labels (i.e., tokens) and that these tokens can be used as text-like for pre-training model modeling and training.
In one possible implementation, style data may also be obtained; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data; wherein, for text data, the style data may exemplarily relate to at least one of: the font, the font size, whether the text is thickened or not, whether the text is italic or not, the font color and the like; for image data, the style data may illustratively relate to at least one of: whether the image is shot for a real object, whether the image is a cartoon image, whether the image is a black-and-white image, whether the image is a gray image and the like; for speech data, the style data may illustratively relate to at least one of: fundamental frequency (pitch), energy (energy), speaker timbre (speech), prosody, etc.
When the sounding body sounds due to vibration, the sound can be decomposed into a plurality of pure sine waves, that is, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is a fundamental tone (i.e., the fundamental frequency, which can be represented by F0), and the sine waves with higher frequencies are overtones.
Prosody generally refers to features that control the functions of intonation, pitch, accent emphasis, pause, and rhythm. Prosody may reflect the emotional state of the speaker or the form of speech, etc.
Each mode needs to train its codec under self-supervision, and taking Speech (Speech) mode as an example, the training process of its encoder (Speech) and decoder is: speechEncoder encodes the original input speech data into content features and style features, and a SpeechDecoder reconstructs the original speech data from the content and style features. The training process of SpeechEncoder and SpeechDecoder is not limited to an end-to-end combination mode, an Encoder before Decoder Stage mode, and the like. And they depend at least on the difference between the reconstruction result and the original input data (i.e. reconstruction loss) and on the learning objectives of minimization of mutual information between Content and Style in the modeling construction. Similarly, the Encoder and Decoder of other modalities can be trained in this way, so as to obtain { Encoder, decoder } of each modality.
502. Mapping the first feature representation to a first discrete label through a first mapping network;
503. mapping, by a second mapping network, the second feature representation to a second discrete label; the difference between the second discrete label and the first discrete label is used to update the second mapping network.
In the embodiment of the present application, corresponding mapping networks may be deployed for data of different modalities, and the mapping networks may map the feature representations to discrete spaces, for example, the feature representations may be mapped to one discrete category label. Assuming that the dimension of the original content feature is N x D (where N is the number of features and D is the size of each feature vector), it is discretized and converted into N x 1, i.e. a discrete sequence of length N. The label dictionary formed by all the class labels is a discrete space u, and the stool and the urine of the label dictionary are the total number of the different class labels, such as K; the discrete space is not limited to be constructed by clustering (i.e., clustering the content features into K clusters by using a clustering algorithm such as K-means, so that each content feature can be mapped to the label of the cluster with the minimum distance to the content feature), VQVAE, and the like.
In order to ensure that the subsequent task network can be compatible with multi-modal data at the same time, that is, data in different modalities are input equally and indiscriminately, feature expressions of data in different modalities can be mapped to the same discrete space, for example, data in different modalities having the same semantic feature can be mapped to the same label in the discrete space, and data in different modalities having similar semantic features can be mapped to similar labels in the discrete space.
Specifically, one of the plurality of modalities may be used as a reference modality, and the other modalities may be used as auxiliary modalities, and the corresponding mapping network is trained through data of the reference modality, and the mapping network of the auxiliary modality looks at the discrete space determined by the reference modality.
Next, a schematic of converting a feature representation (which may also be referred to as a content representation in the present application) of different modalities into a unified discrete representation space u is introduced:
referring to FIG. 7, assuming that there are N discrete Labels in the dictionary set (alternatively referred to as discrete space) of u, the content of each sample from a different modality will be represented by a sequence consisting of Labels in u; in the following description, a Speech2Unit, a Text2Unit and an Image2Unit are used to respectively represent a model in which Speech content is mapped to u, a model in which Text content is mapped to u and a model in which Image content is mapped to u;
pre-training a language model: the base language model (denoted as unit LM, unit based language model) is pre-trained using discrete content sequences of input data from different modalities. The model can be constructed by referring to Bert or Bart models in natural language processing, the input and output of the model are discrete content sequences in training, and the model learns abundant semantic information among the sequences in an automatic supervision training mode;
the training data required in this stage is a large amount of unsupervised data of each modality, and a small amount of parallel corpora (i.e. labeled paired data) between each auxiliary modality and the reference modality.
In one possible implementation, a reference modality may be determined, and for data of a plurality of different modalities (e.g., voice, text, image), one of the modalities (e.g., voice) is first selected as the reference modality, and the remaining other modalities are identified as auxiliary modalities. In the subsequent training step, the determination of the discrete space is mainly determined by the data of the reference mode, and the auxiliary mode looks well towards the discrete space determined by the reference mode;
in one possible implementation, the discrete space u may be constructed, and specifically, the discrete space may be constructed according to the content features output by the Encoder of the reference modality. This step maps each content feature (which was originally a vector representation) to a discrete class label. Assuming that the dimension of the original content feature is N x D (where N is the number of features and D is the size of each feature vector), it is discretized and converted into N x 1, i.e. a discrete sequence of length N. The label dictionary formed by all the class labels is a discrete space u, and the stool and the urine of the label dictionary are the total number of the different class labels, such as K; the construction mode of the discrete space is not limited to the mode of clustering (namely clustering algorithm such as K-means is used for clustering content features into K clusters, so that each content feature can be mapped into a label of a cluster with the minimum distance to the content feature), VQVAE and the like;
in one possible implementation, referring to FIG. 6, each auxiliary modality may be mapped to u, assuming the reference modality is Speech, S3 has constructed a discrete space u and a Speech2Unit. For other auxiliary modalities, a conversion model mapped to the discrete space u needs to be newly constructed. Taking Text as an example, a Text2Unit model needs to be constructed and trained in this step. The Text2Unit needs a small amount of supervision data (namely labeled paired Text voice data) to perform modeling, content characteristics of a Text are used as model input, and a discrete Unit sequence of a corresponding voice is used as target output. Similarly, for x2Unit models of other modalities, the similar method can be adopted to construct.
504. And executing a first target task through a first task network according to the first discrete label to obtain a first result.
505. And executing the target task through the language model according to the second discrete label to obtain a second result.
In one possible implementation, a pre-trained model (which may be referred to as a language model subsequently) may be trained based on the first discrete label and the second discrete label, for example, a portion of the discrete label may be masked and predicted. Since discrete labels are known per se, it amounts to performing an auto-supervised learning of a pre-trained model.
In one possible implementation, the first discrete label includes a first sub-representation and a second sub-representation; predicting data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
In one possible implementation, the second discrete label includes a third sub-representation and a fourth sub-representation; predicting data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
It should be understood that the first result and the second result described above may also belong to discrete labels within a discrete space.
For example, taking the language model as unitLM as an example, the unlabeled data of different modalities can be mapped to the same discrete space u in the above manner, so as to obtain a large number of Unit sequences derived from data of different modalities. LM training is carried out by utilizing the Unit sequences, and a network model of LM can refer to a pre-training model structure in natural language processing, such as a BERT model, a BART model and the like.
In one possible implementation, for data of different modalities, corresponding decoders may be deployed to perform data generation tasks of the corresponding modalities.
After obtaining the discrete label, the execution of the generating task may be performed by a decoder, for example, a first prediction result corresponding to the first input data may be obtained by a first decoder corresponding to the first modality according to the first sub-representation and the first result (or according to the first discrete label); the difference between the first prediction result and the first input data is used to update the first decoder. According to the second sub-representation and the second result (or according to a second discrete tag), obtaining a second prediction result corresponding to the second input data through a second decoder corresponding to the second modality; the difference between the second prediction result and the second input data is used to update the second decoder.
In a possible implementation, the style data obtained as described above may also be used as input to a decoder.
Illustratively, a reconstruction model (i.e., decoder) such as SpeechDecoder, textDecoder, imageDecoder, etc. may be constructed for each of the different modalities. The reconstruction model reconstructs the original input generating the modality based on the discrete content sequences and the corresponding Style features.
In the fine tuning FineTune stage: according to different specific task scenarios, the supervision data in the scenario can be used to perform fineunit on a language model (such as unitLM) and a Decoder, so that the optimal task performance can be obtained.
Illustratively, the xEncoder trained in the previous stage can be used to extract the source input and the target output in the supervised data to extract their Content and Style characteristics, respectively, which are respectively denoted as { Contentsrc, stylesrc } and { Contenttgt, styltgt }; the content features Contentsrc and Contenttgt are mapped to u-space using the corresponding x2Unit model, and the results are denoted as Unitsrc and Unittgt, respectively. It was then supervised trained using Unitsrc and Unittgt based on the unitLM trained in the previous stage. And finely adjusting the Decoder Decoder model of the corresponding modality by using the Style characteristics specified by the user and the unit sequence generated in the current task scene.
The embodiment of the application provides a model training method, which comprises the following steps: acquiring a first feature representation and a second feature representation, wherein the first feature representation is obtained by processing first input data through an encoder, and the second feature representation is obtained by processing second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second modality; the first modality and the second modality are different; mapping the first feature representation to a first discrete label through a first mapping network; mapping, by a second mapping network, the second feature representation to a second discrete label; the difference between the second discrete label and the first discrete label is used to update the second mapping network; executing a first target task through a first task network according to the first discrete label to obtain a first result; and executing the target task through the language model according to the second discrete label to obtain a second result. By mapping the feature representations of the data of different modalities into the same discrete space, different input data can be uniformly and indiscriminately trained, and thus, the multi-modal feature representation can be modeled based on the u-space.
The method in the embodiments of the present application will be described with reference to several specific scenarios:
scene 1: translation between speech texts
In this embodiment, an implementation process of the embodiment of the present application is described by taking an arbitrary translation scenario between voice texts (that is, a user may input a text or a voice and requires a system to output a corresponding translated target text or voice) as an example.
The scenario involves data processing in two different modalities, speech and text. Specifically, without loss of generality, the implementation is described in terms of english-to-chinese, and in terms of varying degrees of task output requirements.
As shown in fig. 8, the task includes at least two modalities of coding and decoding. Since there is no particular requirement for output on style, the description of style characteristics is not omitted in fig. 8 (i.e., speechecoder is considered to output only content characteristics). The present embodiment highlights the discrete representation and joint modeling of multimodal content. Specifically, the following steps S1 to S8 may be included, and it should be understood that there is no explicit timing limitation between steps S1 to S8.
S1: selecting the voice as a reference mode (determining the reference mode), wherein the text mode is an auxiliary mode;
s2: for voice modes, in order to obtain accurate content characteristics and reconstruct, the pre-training model SPIRAL base model is used as a SpeechEncoder, and the HifigAN model is used as a SpeechDecoder in the embodiment of the application; for the Text modality, the textencor can be simply designed to convert the original Text into a Phoneme sequence representation, and the textdecor is the logic for converting phonemes into Text, so that they can be modeled using a graph-to-phone tool (e.g., g2 pM) and CTC network, respectively. In the training process, the SpeechEncoder can use a large amount of unmarked Chinese voice corpora and English voice corpora to train according to the SPIRAL method; the SpeechDecode can use the phonetic corpus of a single speaker (named SpeakerA), the input of which is the characteristic of the voice after passing through SpeechEncoder and the output is the Waveform of the voice, and train the SpeechDecode model corresponding to the tone color of SpeakerA. The TextEncoder uses the off-the-shelf g2pM, while the TextDecoder may not be trained for this step.
S3: (constructing a discrete space u) features output by the SpeechEncoder in S2 are clustered by using a K-means clustering algorithm (the number of clustered target clusters is set to K =1024 in the embodiment). After the cluster training is completed, 1024 clusters are generated, and the IDs (i.e., class labels) of the clusters and the central vectors thereof are the target Codebook. Each spech output feature is also mapped to the Codebook by nearest neighbor matching, and then to its class label (i.e., discretization, spech 2 Unit). The resulting Codebook is also the target discrete space u.
S4: and (mapping each auxiliary modality to u) constructing a Text2Unit model by using a small amount of Text voice parallel corpora (from a voice synthesis training corpus or a voice recognition corpus). Assuming that the corpus is labeled as { X, Y }, X is converted into a phoneme sequence through a TextEncoder (namely g2 pM), Y is converted into a Unit sequence through a SpeechEncoder and a discretization operation, and then the Text2Unit is constructed and trained by using an encoder-decoder model based on a Transformer. In this example, 6-layer transformers were used for the Encoder and Decoder of Text2Unit, respectively. Finally, the Text corpora of all Text modes are converted into a Unit sequence form through a Text2Unit model.
Although not shown in fig. 7, the following steps may be further included:
s5: (construction and training unitLM) the BART model is used in this example as the language model here. Namely, all the unit sequences obtained from S3 and S4 are used as a training set, and self-pre-training is performed completely according to the training mode of BART.
S6: (Finetune unit LM for constructing translation model)
And collecting translation parallel corpora (which may be in the form of text- > speed, speed- > text and the like) in all the English translations, and converting all English source data and target Chinese data into unit sequences according to the modalities, wherein the unit sequences are respectively marked as Usrc and Utgt. And (3) performing Finetune training on the BART model based on the pre-trained BART model in the previous step by using { Usrc, utgt }, wherein the finally obtained model is a target translation model.
S7: (Finetune Decoder model)
Finnetune spechdecoder: for the speaker training corpus, all unit sequences of text and speech are obtained, and then finetune training is performed on the basis of the SpeechDecode using these unit sequences as input and the original Waveform as output.
TextDecoder training: and collecting corpora of which the target mode in the training set is Text, outputting the Text as a target, and taking the unit in the middle as an input to train the CTC model. And finally obtaining the CTC model which is the TextDecoder.
After the steps, no matter English text or English voice is input, the system can translate the input into corresponding Chinese content (which can be presented in a text form or in a SpeakerA tone).
To ensure style preservation in end-to-end speech translation, the object of the present problem additionally includes speech timbre and style consistency compared to the above embodiments, and thus involves control and utilization of speech style characteristics. The embodiment highlights the decoupling of content and style, and thus the generation of the control target. Taking the following as an example, different parts in a sentence are read repeatedly, and different meanings can be expressed:
in the text I love students English, the 'I' is reread, the highlight 'I' is that I himself likes to learn English but not others, so that Chinese corresponding to the later stage of translation also needs to be reread, namely 'I likes to learn English';
similarly, in the text I love students English, the highlight "I like learning English very much" should be highlighted as well as the "like" in the "I like learning English" translated;
in the text I love students English, the fact that I like learning English but not other languages is highlighted, and English needs to be rereaded after translation.
This example describes a specific implementation of the retention of Style features (including timbre and rereading) using a speech translation as an example.
In order to fully utilize training data in speech and text modalities and ensure content correctness of speech translation, the specific implementation of this embodiment is substantially the same as that of embodiment a, and the main change lies in the extraction and utilization of Style, i.e. spechencoder and spechdecoder in this embodiment will be different (as shown in fig. 9).
In this embodiment, the spechencoder is implemented by using a ContentEncoder model and a styleenoder model, and a model structure thereof is shown in fig. 10. Wherein, the ContentEncoder is still implemented by using the pre-trained SPIRAL base model, and the training can be completely consistent with the steps of the embodiment a. And for the styleneencoder, it is composed of three sub-encoders, i.e., { Ep, ee, es }, which are used to encode three kinds of features of fundamental frequency (pitch), energy (energy) and speaker timbre (speaker embedding), respectively. Referring to the practice in speed-resyn, both Ep and Ee adopt a pre-trained VQVAE model, a discrete codebook with the size of 20 is respectively constructed for pitch and energy characteristics, and finally pitch and energy output a discrete characteristic every 80 ms. And Es is to extract a Speaker Embedding vector with a fixed size (265 dimensions) from each input voice sample, and the d-vector extraction model pre-trained by using the voiceprint voice data set of multiple speakers is also adopted as Es in the embodiment.
In the pre-training stage, the extracted unit sequence of Content, pitch, energy, spaker embedding and other features are used for carrying out up-sampling and splicing according to the sampling rate of the feature in order to serve as the input of a HifigAN model, and the HifigAN speech generation model is trained; however, in the Finetune stage, the unit sequence of the input content is the unit sequence of the translated target language, which has the problem of mismatch with the original pitch and energy. In order to solve the problem, two transformation models are additionally trained in the Finetune stage, namely 1) a target energy prediction model and 2) a target pitch prediction model;
for the energy prediction model, which now knows the unit sequence (unity) and energy sequence (unity) of the input speech and the translated target unit sequence (unity), it can be constructed by using a 3-layer transform model of cross-annotation mechanism, as shown in fig. 11.
For the pitch prediction model, the network design of the pitch predictor in FastSpeech 2 (i.e. 2 layers 1D CNN layer and one more Linear layer) can be referred to, and in this embodiment, a unit sequence is input, and the output is discrete features extracted from the target speech on the VQVAE model of pitch.
Scene 2: emotional voice conversation
Scene description: the user A and the robot B carry out voice interactive conversation, and the robot B can sense the voice and the content emotion of the user and automatically generate corresponding emotion voice during reply.
The realization path is as follows: emotion is also a type of Style feature, and guides the extraction of Content features and emotion features, and the speech generation of the Content features and emotion features, respectively, in a SpeechEncoder/SpeechDecoder implementation.
Referring to FIG. 12, as with scenario 1, pre-training for unitLM may still be trained using all possible text and speech data: the data is discretized and then the BART model is pre-trained uniformly. In the Finetune stage, all possible dialogue parallel corpora are used for Finetune on the unitLM, and then a final unitLM dialogue generating model is generated. For emotion features, a pre-trained speech emotion recognition model can be used (in this embodiment, the feature vectors output by the last full-connected layer of the model described in the literature are used as emotion features). For Emotion prediction of the target dialog, an Emotion conversion model (Emotion conversion model) can be constructed. The input of the emotion conversion model is the emotion characteristics of the source speech and the target dialogue result (unit sequence form) generated by the dialogue model, and the learning target is the target emotion characteristics. The emotion conversion model used in the embodiment has the same network structure as the speech emotion recognition model in SpeechEncoder, and the unit sequence and the feature spliced by the source speech emotion feature are used as input in training. In this way, emotion conversion of a voice conversation can be realized.
Scene 3: automatic reading of image and character
And (3) task description: namely, the text content in the image is automatically recognized and directly read out in a voice mode.
As shown in fig. 13, the present embodiment includes the following steps:
in the pre-training phase:
a) For Speech and Text, the Encoder and Decode are designed and implemented as the previous two scenario embodiments;
b) And for the OCR image, the Encoder implementation thereof comprises the following steps:
i) Scaling the image to a certain resolution (e.g., 448 x 448);
ii) obtaining the location (i.e., the extent region) of the line of text using an OCR tool;
iii) Extracting visual representations of text lines by using MaskRCNN;
iv) constructing an Image2Unit model, and mapping the visual features to a discrete space U (the Image2Unit can be trained by parallel linguistic data of OCR and text, namely the visual features are used as input, and the Unit obtained by the text is used as target output);
v) pre-training the language model;
in the task Finetune phase: and training on the basis of the language model obtained in the last step by using paired OCR and voice corpus to obtain a final Image-to-speech _ unit model. The obtained speech unit can be synthesized into final target speech through a speech decoder.
In addition, an embodiment of the present application further provides a data processing method, which may be applied to an inference stage of a model, where the model may be a model that is pre-trained or fine-tuned in the foregoing embodiment, and the method includes:
acquiring a first feature representation and a second feature representation, wherein the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first modality; the second input data is data of a second mode; the first modality and the second modality are different;
mapping the first feature representation to a first discrete label through a first mapping network;
mapping the second feature representation to a second discrete label through a second mapping network; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
In one possible implementation, when training the first mapping network and the second mapping network, the truth labels of the input data having the same semantics are the same discrete values.
In a possible implementation, in training the second mapping network, a truth label corresponding to input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic meaning as the input data of the second mapping network.
In one possible implementation, style data may be obtained; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
and executing the target task through a language model according to the first discrete label and the style data.
Referring to fig. 14 and fig. 14 are schematic structural diagrams of a model training apparatus provided in an embodiment of the present application, and as shown in fig. 14, the apparatus 1400 includes:
an obtaining module 1401, configured to obtain a first feature representation and a second feature representation, where the first feature representation is obtained by processing a first input data through an encoder, and the second feature representation is obtained by processing a second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different;
for specific description of the obtaining module 1401, reference may be made to the description of step 501 in the foregoing embodiment, and details are not described here.
A mapping module 1402, configured to map the first feature representation to a first discrete label via a first mapping network;
mapping, by a second mapping network, the second feature representation to a second discrete label; the difference between the second discrete label and the first discrete label is used to update the second mapping network;
for a detailed description of the mapping module 1402, reference may be made to the description of step 502 and step 503 in the foregoing embodiment, which is not described herein again.
A task module 1403, configured to execute a first target task through a first task network according to the first discrete tag, so as to obtain a first result;
and executing the target task through the language model according to the second discrete label to obtain a second result.
For a detailed description of the task module 1403, reference may be made to the description of step 504 and step 505 in the foregoing embodiment, which is not described herein again.
In one possible implementation, the first discrete label includes a first sub-representation and a second sub-representation; the task module is specifically configured to:
predicting the data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
In one possible implementation, the second discrete label includes a third sub-representation and a fourth sub-representation; the task module is specifically configured to:
predicting data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
In one possible implementation, the task module is further configured to:
obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first modality according to the first sub-representation and the first result; the difference between the first prediction result and the first input data is used to update the first decoder.
In one possible implementation, the task module is further configured to:
according to the second sub-representation and the second result, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
In one possible implementation, the method further comprises: according to the first discrete label, a first prediction result corresponding to the first input data is obtained through a first decoder corresponding to the first mode; the difference between the first prediction result and the first input data is used to update the first decoder.
In one possible implementation, the first modality and the second modality are one of speech, text, or images; alternatively, the first and second liquid crystal display panels may be,
in one possible implementation, the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing a target task through a language model according to the first discrete label and the style data.
An embodiment of the present application further provides a data processing apparatus, where the apparatus includes:
an obtaining module, configured to obtain a first feature representation and a second feature representation, where the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first modality; the second input data is data of a second modality; the first modality and the second modality are different;
a mapping module for mapping the first feature representation to a first discrete label over a first mapping network;
mapping, by a second mapping network, the second feature representation to a second discrete label; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
the task module is used for executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
In one possible implementation, when training the first mapping network and the second mapping network, the truth labels of the input data having the same semantics are the same discrete values.
In a possible implementation, when the second mapping network is trained, a true value label corresponding to input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic as the input data of the second mapping network.
In one possible implementation, the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information which is irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing a target task through a language model according to the first discrete label and the style data.
Referring to fig. 15, fig. 15 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 1500 may be embodied as a virtual reality VR device, a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a monitoring data processing device or a server, which is not limited herein. Specifically, the execution apparatus 1500 includes: a receiver 1501, a transmitter 1502, a processor 1503 and a memory 1504 (where the number of processors 1503 in the execution device 1500 may be one or more, one processor is taken as an example in fig. 15), wherein the processor 1503 may comprise an application processor 15031 and a communication processor 15032. In some embodiments of the application, the receiver 1501, the transmitter 1502, the processor 1503 and the memory 1504 may be connected by a bus or other means.
Memory 1504 may include both read-only memory and random access memory and provides instructions and data to processor 1503. The portion of the memory 1504 may also include non-volatile random access memory (NVRAM). The memory 1504 stores the processor and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.
The processor 1503 controls the operation of the execution device. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.
The method disclosed in the above embodiments of the present application may be applied to the processor 1503, or implemented by the processor 1503. The processor 1503 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by instructions in the form of hardware integrated logic circuits or software in the processor 1503. The processor 1503 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor 1503 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1504, and the processor 1503 reads the information in the memory 1504 and, in conjunction with the hardware thereof, performs the steps involved in the model inference process in the above-described method.
The receiver 1501 may be used to receive input numeric or character information and generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1502 may be configured to output numeric or character information via the first interface; the transmitter 1502 may also be configured to send instructions to the disk pack via the first interface to modify data in the disk pack; the transmitter 1502 may also include a display device such as a display screen.
Referring to fig. 16, fig. 16 is a schematic structural diagram of a training device provided in the embodiment of the present application, specifically, training device 1600 is implemented by one or more servers, training device 1600 may generate relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1616 (e.g., one or more processors) and a memory 1632, and one or more storage media 1630 (e.g., one or more mass storage devices) storing application programs 1642 or data 1644. Memory 1632 and storage media 1630 may be transient or persistent storage, among others. The program stored on the storage medium 1630 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 1616 may be configured to communicate with storage medium 1630 to perform a series of instructional operations on training device 1600, among other things, on storage medium 1630.
Training apparatus 1600 may also include one or more power supplies 1626, one or more wired or wireless network interfaces 1650, one or more input-output interfaces 1658; or one or more operating systems 1641, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
In this embodiment, the central processor 1616 is configured to execute actions related to model training in the above embodiments.
Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit to enable the chip in the execution device to execute the data processing method described in the above embodiment, or to enable the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.
Specifically, referring to fig. 17, fig. 17 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1700, and the NPU 1700 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1703, and the controller 1704 controls the arithmetic circuit 1703 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuitry 1703 includes multiple processing units (PEs) within it. In some implementations, the operational circuitry 1703 is a two-dimensional systolic array. The arithmetic circuitry 1703 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1703 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of the matrix B from the weight memory 1702 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit fetches the matrix a data from the input memory 1701, performs matrix arithmetic on the matrix a data and the matrix B data, and stores a partial result or a final result of the matrix in an accumulator (accumulator) 1708.
The unified memory 1706 is used for storing input data and output data. The weight data is directly transferred to the weight Memory 1702 via a Memory Access Controller (DMAC) 1705. Input data is also carried through the DMAC into the unified memory 1706.
The BIU is a Bus Interface Unit 1710, which is used for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 1709.
A Bus Interface Unit 1710 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1709 to obtain instructions from the external memory, and is also used for the memory Unit access controller 1705 to obtain the original data of the input matrix a or the weight matrix B from the external memory.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1706, or transfer weight data to the weight memory 1702, or transfer input data to the input memory 1701.
The vector calculation unit 1707 includes a plurality of operation processing units, and further processes the output of the operation circuit 1703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.
In some implementations, the vector calculation unit 1707 can store the vector of processed outputs to the unified memory 1706. For example, the vector calculation unit 1707 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1703, such as a linear interpolation of the feature planes extracted from the convolutional layers, and then such as a vector of accumulated values to generate the activation values. In some implementations, the vector calculation unit 1707 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 1703, e.g., for use in subsequent layers in a neural network.
An instruction fetch buffer (issue fetch buffer) 1709 connected to the controller 1704 and used for storing instructions used by the controller 1704;
the unified memory 1706, input memory 1701, weight memory 1702, and instruction fetch memory 1709 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, which may be specifically implemented as one or more communication buses or signal lines.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optics, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device, such as a training device, data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims (27)

1. A method of model training, the method comprising:
acquiring a first feature representation and a second feature representation, wherein the first feature representation is obtained by processing first input data through an encoder, and the second feature representation is obtained by processing second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different; the first input data and the second input data are data with the same semantic meaning;
mapping the first feature representation to a first discrete label through a first mapping network;
mapping the second feature representation to a second discrete label through a second mapping network; the difference between the second discrete label and the first discrete label is used to update the second mapping network;
executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
2. The method of claim 1, wherein the first discrete tag comprises a first sub-representation and a second sub-representation; the executing a first target task through a first task network according to the first discrete tag to obtain a first result, including:
predicting data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
3. The method of claim 1 or 2, wherein the second discrete label comprises a third sub-representation and a fourth sub-representation; the executing a second target task through a second task network according to the second discrete tag to obtain a second result, including:
predicting data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
4. A method according to claim 2 or 3, characterized in that the method further comprises:
obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first modality according to the first sub-representation and the first result; the difference between the first prediction result and the first input data is used to update the first decoder.
5. The method of any of claims 2 to 4, further comprising:
according to the second sub-representation and the second result, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
6. The method of any of claims 1 to 5, further comprising: according to the first discrete label, obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first mode; the difference between the first prediction result and the first input data is used to update the first decoder.
7. The method of any of claims 1 to 6, wherein the first modality and the second modality are one of speech, text, or images; alternatively, the first and second electrodes may be,
the first result is related to or generated according to the semantic features of the first input data; alternatively, the first and second electrodes may be,
the second result is related to or generated from the semantic features of the second input data.
8. The method of any of claims 1 to 7, further comprising:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data;
the executing the target task through the language model according to the first discrete tag comprises:
and executing a target task through a language model according to the first discrete label and the style data.
9. A method of data processing, the method comprising:
acquiring a first feature representation and a second feature representation, wherein the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different;
mapping the first feature representation to a first discrete label through a first mapping network;
mapping the second feature representation to a second discrete label through a second mapping network; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
10. The method of claim 9, wherein the truth labels of input data having the same semantics are the same discrete values when training the first mapping network and the second mapping network.
11. The method according to claim 9 or 10, wherein, when training the second mapping network, the truth label corresponding to the input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic meaning as the input data of the second mapping network.
12. The method according to any one of claims 9 to 11, further comprising:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data;
the executing the target task through the language model according to the first discrete tag comprises:
and executing the target task through a language model according to the first discrete label and the style data.
13. A model training apparatus, the apparatus comprising:
an obtaining module, configured to obtain a first feature representation obtained by processing first input data through an encoder and a second feature representation obtained by processing second input data through the encoder; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different; the first input data and the second input data are data with the same semantic meaning;
a mapping module for mapping the first feature representation to a first discrete label over a first mapping network;
mapping, by a second mapping network, the second feature representation to a second discrete label; the difference between the second discrete label and the first discrete label is used to update the second mapping network;
the task module is used for executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
14. The apparatus of claim 13, wherein the first discrete tag comprises a first sub-representation and a second sub-representation; the executing a first target task through a first task network according to the first discrete tag to obtain a first result, including:
predicting data of the position of the second sub-representation in the first discrete label through a language model according to the first sub-representation to obtain a first result; the difference between the first result and the second sub-representation is used to update the language model.
15. The apparatus of claim 13 or 14, wherein the second discrete label comprises a third sub-representation and a fourth sub-representation; the executing a second target task through a second task network according to the second discrete tag to obtain a second result, including:
predicting data of the position of the fourth sub-representation in the second discrete label through a language model according to the third sub-representation to obtain a second result; the difference between the second result and the fourth sub-representation is used to update the language model.
16. The apparatus of claim 14 or 15, wherein the task module is further configured to:
obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first modality according to the first sub-representation and the first result; the difference between the first prediction result and the first input data is used to update the first decoder.
17. The apparatus of any of claims 14 to 16, wherein the task module is further configured to:
according to the second sub-representation and the second result, a second prediction result corresponding to the second input data is obtained through a second decoder corresponding to the second mode; the difference between the second prediction result and the second input data is used to update the second decoder.
18. The apparatus of any of claims 13 to 17, wherein the method further comprises: according to the first discrete label, obtaining a first prediction result corresponding to the first input data through a first decoder corresponding to the first mode; the difference between the first prediction result and the first input data is used to update the first decoder.
19. The apparatus of any of claims 13 to 18, wherein the first modality and the second modality are one of speech, text or images; alternatively, the first and second liquid crystal display panels may be,
the first result is related to or generated according to the semantic features of the first input data; alternatively, the first and second electrodes may be,
the second result is related to or generated from the semantic features of the second input data.
20. The apparatus according to any one of claims 13 to 19, wherein the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing the target task through a language model according to the first discrete label and the style data.
21. A data processing apparatus, characterized in that the apparatus comprises:
an obtaining module, configured to obtain a first feature representation and a second feature representation, where the first feature representation is a feature representation of first input data, and the second feature representation is a feature representation of second input data; the first input data is data of a first mode; the second input data is data of a second mode; the first modality and the second modality are different;
a mapping module for mapping the first feature representation to a first discrete label via a first mapping network;
mapping, by a second mapping network, the second feature representation to a second discrete label; wherein the first mapping network and the second mapping network are used for mapping input data to the same discrete space;
the task module is used for executing a first target task through a first task network according to the first discrete label to obtain a first result;
and executing a second target task through a second task network according to the second discrete label to obtain a second result.
22. The apparatus of claim 21, wherein truth labels of input data having the same semantics are the same discrete values when training the first mapping network and the second mapping network.
23. The apparatus according to claim 21 or 22, wherein when training the second mapping network, the truth label corresponding to the input data of the second mapping network is a target value, and the target value is a discrete value obtained when the first mapping network processes data having the same semantic meaning as the input data of the second mapping network.
24. The apparatus according to any one of claims 21 to 23, wherein the obtaining module is further configured to:
obtaining style data; the style data is obtained by processing the first input data through a style extraction network; the style data is information irrelevant to semantics in the first input data;
the task module is specifically configured to:
and executing the target task through a language model according to the first discrete label and the style data.
25. A computer storage medium storing one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 12.
26. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any one of claims 1 to 12.
27. A system comprising at least one processor, at least one memory; the processor and the memory are connected through a communication bus and complete mutual communication;
the at least one memory is for storing code;
the at least one processor is configured to execute the code to perform the method of any of claims 1 to 12.
CN202211350390.6A 2022-10-31 2022-10-31 Model training method and device Pending CN115688937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211350390.6A CN115688937A (en) 2022-10-31 2022-10-31 Model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211350390.6A CN115688937A (en) 2022-10-31 2022-10-31 Model training method and device

Publications (1)

Publication Number Publication Date
CN115688937A true CN115688937A (en) 2023-02-03

Family

ID=85045715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211350390.6A Pending CN115688937A (en) 2022-10-31 2022-10-31 Model training method and device

Country Status (1)

Country Link
CN (1) CN115688937A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434763A (en) * 2023-06-12 2023-07-14 清华大学 Autoregressive audio generation method, device, equipment and storage medium based on audio quantization
CN116502882A (en) * 2023-06-30 2023-07-28 杭州新中大科技股份有限公司 Engineering progress determining method and device based on multi-mode time sequence information fusion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434763A (en) * 2023-06-12 2023-07-14 清华大学 Autoregressive audio generation method, device, equipment and storage medium based on audio quantization
CN116502882A (en) * 2023-06-30 2023-07-28 杭州新中大科技股份有限公司 Engineering progress determining method and device based on multi-mode time sequence information fusion
CN116502882B (en) * 2023-06-30 2023-10-20 杭州新中大科技股份有限公司 Engineering progress determining method and device based on multi-mode time sequence information fusion

Similar Documents

Publication Publication Date Title
WO2021051544A1 (en) Voice recognition method and device
CN112288075B (en) Data processing method and related equipment
CN111312245B (en) Voice response method, device and storage medium
CN112257858A (en) Model compression method and device
CN111951805A (en) Text data processing method and device
CN114676234A (en) Model training method and related equipment
CN115688937A (en) Model training method and device
CN113421547B (en) Voice processing method and related equipment
CN112863529B (en) Speaker voice conversion method based on countermeasure learning and related equipment
WO2023207541A1 (en) Speech processing method and related device
CN115512005A (en) Data processing method and device
CN113505193A (en) Data processing method and related equipment
CN114707513A (en) Text semantic recognition method and device, electronic equipment and storage medium
CN116432019A (en) Data processing method and related equipment
CN115221846A (en) Data processing method and related equipment
CN116541492A (en) Data processing method and related equipment
CN113656563A (en) Neural network searching method and related equipment
CN116737895A (en) Data processing method and related equipment
CN116052714A (en) Data processing method and device
CN116910202A (en) Data processing method and related equipment
CN115757692A (en) Data processing method and device
CN115866291A (en) Data processing method and device
CN114333772A (en) Speech recognition method, device, equipment, readable storage medium and product
CN113948060A (en) Network training method, data processing method and related equipment
CN113792537A (en) Action generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination