WO2023040545A1 - 一种数据处理方法、装置、设备、存储介质和程序产品 - Google Patents

一种数据处理方法、装置、设备、存储介质和程序产品 Download PDF

Info

Publication number
WO2023040545A1
WO2023040545A1 PCT/CN2022/112643 CN2022112643W WO2023040545A1 WO 2023040545 A1 WO2023040545 A1 WO 2023040545A1 CN 2022112643 W CN2022112643 W CN 2022112643W WO 2023040545 A1 WO2023040545 A1 WO 2023040545A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
target
modeling unit
output
fully connected
Prior art date
Application number
PCT/CN2022/112643
Other languages
English (en)
French (fr)
Inventor
凡子威
占吉清
余健
王砚峰
朱运
赵昂
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2023040545A1 publication Critical patent/WO2023040545A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This application relates to the field of machine learning and in particular to data processing.
  • Pre-training can obtain pre-trained models from large-scale data through self-supervised learning. Moreover, the pre-trained model can transfer the knowledge learned from large-scale data to other small-scale models to achieve other tasks that are not related to the specific tasks of the pre-trained model.
  • the business model can be optimized through the pre-training model, so that the business model can better serve other tasks.
  • the business model cannot be optimized using the pre-trained model, so that the effect of the business model cannot be improved.
  • the embodiments of the present application provide a data processing method, device, device, storage medium, and program product, which expand the applicable scope of knowledge distillation and effectively improve the business performance of the business model.
  • the embodiment of the present application provides a data processing method, the method comprising:
  • the model unit is constructed by a first granularity
  • the second modeling unit of the initial business model is constructed by a second granularity
  • the first granularity and the second granularity are different granularities
  • the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit;
  • an embodiment of the present application provides a data processing device, and the device includes:
  • a first acquisition unit configured to acquire business data
  • the input unit is used to input business data into the pre-training model and the initial business model, and obtain the output of the first fully connected layer of the pre-training model and the output of the second fully connected layer of the initial business model; wherein, the pre-training
  • the first modeling unit of the model is constructed by a first granularity
  • the second modeling unit of the initial business model is constructed by a second granularity
  • the first granularity and the second granularity are different granularities
  • a matching unit configured to match the first modeling unit with the second modeling unit, and determine from the first modeling unit target first modeling units respectively matching with the second modeling unit A modular unit, the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit;
  • a determination unit configured to perform knowledge distillation on the initial business model according to the output of the second fully connected layer and the output of the fully connected layer respectively corresponding to the target first modeling unit in the output of the first fully connected layer Get the target business model.
  • an embodiment of the present application provides a computer device, including:
  • the processor, the communication interface and the memory complete mutual communication through the communication bus;
  • the communication interface is an interface of a communication module;
  • the memory is used to store program codes and transmit the program codes to the processor; the processor is used to invoke instructions of the program codes in the memory to execute the method described in the above aspects.
  • an embodiment of the present application provides a storage medium, where the storage medium is used to store a computer program, and the computer program is used to execute the method described in the above aspect.
  • the embodiments of the present application provide a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above aspect.
  • a pre-training model and an initial business model are input, and an output of a first fully-connected layer of the pre-training model and an output of a second fully-connected layer of the initial business model are obtained. Since the pre-training model and the initial business model construct modeling units with different granularities, the sequence length output by the first fully connected layer is different from the sequence length output by the second fully connected layer.
  • the first modeling unit of the pre-training model is matched with the second modeling unit of the initial business model, and the second modeling unit
  • the output of the fully connected layer of the modular unit has the same sequence length as the output of the fully connected layer of the first modeling unit of the matching target, thus achieving the basis for knowledge distillation, so that with the assistance of the pre-trained model, the initial business model can
  • the target business model is obtained, and the knowledge distillation of models including modeling units of different granularities is realized.
  • the pre-training model can be used to optimize the initial business model to obtain the target business model, thereby expanding the scope of application of knowledge distillation and effectively improving The business performance of the business model.
  • FIG. 1 is a schematic flow diagram of a data processing method provided in an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a client provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the inventors of the present application found through research that the business model can be optimized by using the pre-trained model through knowledge distillation.
  • knowledge distillation there is a prerequisite for using knowledge distillation, that is, the sequence length output by the fully connected layer of the pre-training model is the same as the sequence length output by the fully connected layer of the business model.
  • KLD loss the relative entropy divergence loss function
  • the output of the fully connected layer of the pre-training model includes 3 vectors, namely w1, w2 and w3
  • the output of the fully connected layer of the business model includes 3 vectors, respectively L1, L2 and L3
  • KLD loss w1* L1+w2*L2+w3*L3.
  • the modeling unit of the corresponding business model is word segmentation
  • the modeling unit of the pre-trained model is a single character.
  • a modeling unit corresponds to a fully connected layer output, and the sequence lengths of the fully connected layer outputs corresponding to each modeling unit are the same (for example, the aforementioned dimensions of w1, w2, w3, L1, L2, and L3 are the same).
  • the modeling units of the pre-training model and the business model have different construction granularities (the pre-training model uses characters as the granularity, and the business model uses word segmentation as the granularity), that is, the modeling units of the pre-training model and the business model are different in number. This leads to the inconsistency between the sequence length output by the fully connected layer of the business model and the sequence length output by the fully connected layer of the pre-trained model, making it impossible to perform knowledge distillation. for example:
  • the input of the pre-training model includes 5 modeling units, namely: “I”, “Yes”, “China”, “Country”, and “People”.
  • the input of the business model includes three modeling units, namely "I”, “Yes”, and “Chinese”.
  • the output of the fully connected layer corresponding to each modeling unit is a 5-dimensional vector
  • the output of the fully connected layer is five 5-dimensional vectors
  • the output of the fully connected layer is as three 5-dimensional vectors. That is, the sequence length output by the fully connected layer of the business model is inconsistent with the sequence length output by the fully connected layer of the pre-trained model.
  • Knowledge distillation refers to: transfer the knowledge of the pre-trained model to the business model, so as to optimize the business model.
  • Pre-trained models including but not limited to Bert, GPT, and ELECTRA.
  • BiLSTM Bi-directional Long-Short Term Memory
  • a modeling unit refers to a unit established based on the granularity of model input, generally a single character or a word segment, and a word segment can include one or more characters.
  • the output of the fully connected layer is a vector of unnormalized label probability values. For example, in the scene of adding punctuation marks to the text, there are 4 kinds of punctuation marks that can be selected, then the output of the fully connected layer corresponding to each modeling unit is a 5-dimensional vector, and the specific value of the 5-dimensional vector is used to indicate the 4 The probability of punctuation and the probability of no punctuation.
  • the output of the fully connected layer can also be called Logits output.
  • the embodiment of the present application provides a data processing method, which realizes knowledge distillation for models including modeling units of different granularities, thereby expanding the scope of application of knowledge distillation, and effectively improving the business performance of the business model .
  • the data processing method can be implemented by a computer device, and the computer device can be a terminal device or a server, wherein the server can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or provide Cloud server for cloud computing service.
  • Terminal devices include but are not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, aircraft, etc.
  • the terminal device and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.
  • artificial intelligence is a theory that utilizes digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results , method, technology and application system.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the nature of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, automatic driving, and intelligent transportation.
  • the embodiments of the present application mainly relate to natural language processing technology and machine learning.
  • Natural Language Processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that combines linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technologies usually include text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine learning (Machine Learning, ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specialize in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and teaching learning.
  • the text can be segmented using natural language technology to obtain word segmentation, characters, etc., and feature processing can be performed on the segmentation result to obtain the output of the fully connected layer.
  • the knowledge distillation of the initial business model through the pre-training model is also an effective form of implementation of transfer learning.
  • FIG. 1 this figure is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • a server is used as the aforementioned computer device for illustration, and the method may be implemented, for example, through the following S101-S104.
  • the business data mentioned in the embodiments of the present application refers to data related to a specific business.
  • the embodiment of this application does not specifically limit the service data.
  • the business data is text without punctuation marks, so that through knowledge distillation, the The obtained target business model has the function of labeling punctuation marks for texts without punctuation marks.
  • the text without punctuation marks can be obtained in different ways, which is not limited in this application.
  • the text may be obtained by recognizing speech through a speech recognition technology.
  • S102 Input service data into the pre-training model and the initial service model, and obtain the output of the first fully-connected layer of the pre-training model and the output of the second fully-connected layer of the initial service model.
  • the first modeling unit of the pre-training model is constructed by a first granularity
  • the second modeling unit of the initial business model is constructed by a second granularity
  • the first granularity and the second granularity are different granularities .
  • the pre-training model and the initial service model may be obtained through pre-training.
  • the pre-training model may be trained according to pre-training data and service data.
  • the initial pre-training model can be obtained by using the pre-training data training, and then the initial pre-training model can be fine-tuned (Finetune) by using the business data to obtain the pre-training model based on the business task corresponding to the business data.
  • the pre-training model mentioned in S101 may be a pre-training model based on the business.
  • the pre-training data mentioned here can be training data that is not related to the business task.
  • the amount of data of the pre-training model is relatively large, and direct use will affect business processing efficiency.
  • the business tasks corresponding to the pre-training data and the business tasks corresponding to the business data may be different.
  • the data volume of the pre-training data is greater than the data volume of the business data required for training to obtain the target business model, so that the pre-training data can be trained based on relatively sufficient pre-training data, so The resulting pre-trained model has high accuracy and model parameter scale.
  • the initial service model may be obtained through training using service data.
  • the target business model is mainly applied to business tasks in relatively new fields, so that the amount of business data that can be collected for training is generally much smaller than the amount of pre-training data, especially for some emerging businesses.
  • the amount of business data obtained is very limited. Therefore, the accuracy of the initial business model is often not particularly high, and the size of its model parameters will be smaller than the pre-trained model, which cannot meet the corresponding business tasks.
  • the method of knowledge distillation uses the pre-training model to optimize the initial business model.
  • the pre-training model is used as the teacher model in the knowledge distillation, and the initial business model is used as the student model in the knowledge distillation.
  • the knowledge of the pre-training model is effectively transferred to the In the target business model, the accuracy of the final target business model can be effectively improved without adding additional business data for model training.
  • the first modeling unit of the pre-training model is constructed by the first granularity
  • the second modeling unit of the initial business model is constructed by the second granularity
  • the first granularity and the second granularity are different Granularity, which leads to the fact that the number of modeling units of the pre-training model is not consistent with the number of modeling units of the initial business model. Therefore, the sequence length output by the first fully connected layer of the pre-training model is generally greater than the sequence length output by the second fully connected layer of the initial business model. Therefore, it is difficult to transfer the knowledge of the pre-trained model to the initial business model using the knowledge distillation method in the related art. It is necessary to use the method provided by the embodiment of this application, such as S103-S104, to realize modeling including different granularities. Knowledge distillation between two models of units.
  • the first granularity is a single character
  • the second granularity is word segmentation. That is to say, in this implementation manner, since a participle may include one or more characters, the first granularity is finer than the second granularity.
  • the final target business model can realize business tasks at the granularity of word segmentation in the text, such as adding punctuation marks.
  • one modeling unit corresponds to one fully connected layer output.
  • the sequence lengths of the output sequences of the fully connected layer corresponding to the modeling units of the pre-training model and the initial business model are the same. Therefore, the sequence length output by the fully connected layer of the pre-training model is greater than or equal to the sequence length output by the fully connected layer of the initial business model. And, since it is less likely that each modeling unit of the initial business model includes only one character, in most cases, the sequence length output by the fully connected layer of the pre-training model is greater than the initial The sequence length of the output of the fully connected layer of the business model.
  • S103 Match the first modeling unit with the second modeling unit, and determine target first modeling units respectively matching with the second modeling units from the first modeling units.
  • the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit.
  • S102 and S103 it should be noted that the embodiment of the present application does not specifically limit the execution order of S102 and S103, as long as S102 and S103 are executed between S101 and S104.
  • S102 may be performed before S103, S102 may be performed simultaneously with S103, and S102 may be performed after S103.
  • the target first modeling unit that matches the second modeling unit of the initial business model can be determined, for A second modeling unit and the corresponding target first modeling unit, the output of the fully connected layer of the two has the same sequence length, and because the second modeling unit is determined by matching, the corresponding target first Modeling unit, so the target first modeling unit is not only the same as the corresponding second modeling unit in the sequence length output by its fully connected layer, but also has relevance in the processing of business tasks, so as to achieve the goal of passing the pre-training model Implementation basis for knowledge distillation of initial business models.
  • the target first modeling unit with the same sequence length as the output of the fully connected layer of the second modeling unit can be selected from the output of the fully connected layer of the pre-training model, thereby performing knowledge distillation , to get the target business model.
  • the business data is: "I am Chinese”.
  • the first granularity is individual characters, and the second granularity is word segmentation.
  • the initial business model includes three second modeling units, namely: "I”, “Yes”, and “Chinese”, and the corresponding outputs of the fully connected layer are: a1, a2 and a3 .
  • the pre-training model includes 5 first modeling units, which are: “I”, “Yes”, “China”, “Country”, and “People”, and the corresponding fully connected layer outputs are: b1, b2, b3, b4 and b5. From the first modeling units "I”, “Yes”, “China”, “Country” and "People” of the pre-training model, the targets corresponding to the second modeling units of the initial business model are determined through matching first modeling unit.
  • the target first modeling unit "I” in the pre-training model corresponds to the second modeling unit "I” of the initial business model; determine that the target first modeling unit "Yes” in the pre-training model corresponds to the first The second modeling unit is "yes”; determine that the target first modeling unit "person” in the pre-training model corresponds to the second modeling unit "Chinese” of the initial business model. Then, use a1, a2, a3 and b1, b2, b5 to carry out knowledge distillation to obtain the target business model.
  • the KLD loss can be calculated using the formula a1*b1+a2*b2+a3*b5.
  • the business data is: "I am Chinese”.
  • the first granularity is individual characters, and the second granularity is word segmentation.
  • the initial business model includes three second modeling units, namely: "I”, “Yes”, and “Chinese”, and the corresponding fully connected layer outputs are respectively: a1, a2 and a3.
  • the pre-training model includes 5 first modeling units, which are: “I”, “Yes”, “China”, “Country”, and “People”, and the corresponding fully connected layer outputs are: b1, b2, b3, b4 and b5. From the first modeling units "I”, “Yes”, “China”, “Country” and "People” of the pre-training model, the targets corresponding to the second modeling units of the initial business model are determined through matching first modeling unit.
  • the target first modeling unit "I” in the pre-training model corresponds to the second modeling unit "I” of the initial business model; determine that the target first modeling unit "Yes” in the pre-training model corresponds to the first The second modeling unit is “Yes”; determine that the target first modeling unit "Zhong” in the pre-training model corresponds to the second modeling unit "Chinese” of the initial business model. Then, use a1, a2, a3 and b1, b2, b3 to carry out knowledge distillation to obtain the target business model. When performing knowledge distillation, the formula a1*b1+a2*b2+a3*b3 can be used to calculate KLD loss.
  • the initial business model and the target business model in the embodiment of the present application can be used to process business related to business data.
  • the initial business model and the target business model trained based on the initial business model may be used to add punctuation marks to text.
  • both the service data mentioned here and the text mentioned later may be texts automatically recognized by speech.
  • the punctuation Symbols may be added after the last character of the participle.
  • the participle “Chinese”
  • punctuation marks between the characters " ⁇ ” and the characters " ⁇ ” it is impossible to add punctuation marks between the characters " ⁇ ” and the characters " ⁇ ”.
  • the character "person” may be followed by punctuation.
  • S103 may include:
  • S1032 Based on the word order, determine the first modeling unit corresponding to the last character from the target character set as the target first modeling unit matching the target second modeling unit.
  • the target participle corresponding to a second modeling unit has multiple characters
  • the second modeling unit is recorded as the target second modeling unit
  • the multiple characters corresponding to the target participle constitute the target character set
  • the Multiple characters have corresponding first modeling units in the pre-training model.
  • the first modeling unit corresponding to the last character in the semantic direction in the target character set can be determined as the target first modeling unit.
  • the target character set is ⁇ " ⁇ ", “ ⁇ ", “ ⁇ ” ⁇ , and the target word segmentation is "Chinese”, then the first modeling unit corresponding to the last character " ⁇ " in the target character set The target first modeling unit that is determined to match the target word segmentation.
  • the first modeling unit of the pre-trained model can be Matching with the second modeling unit of the initial business model, the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit, thereby achieving knowledge The basis of distillation, so that with the assistance of the pre-trained model, the target business model can be obtained through the initial business model, and the knowledge distillation of the model including the modeling unit of different granularity is realized.
  • the pre-training model can be used to optimize the initial business model to obtain the target business model, thereby expanding the scope of application of knowledge distillation and effectively improving The business performance of the business model.
  • the target service model may be used to process related services.
  • the method may further comprise the following steps A-D.
  • Step A Get voice data.
  • the voice data may be data entered by the user in real time through a microphone, or data entered and stored by the user in advance, which is not limited here.
  • Step B Recognize the voice data to obtain the target text corresponding to the voice data.
  • speech recognition technology may be used to identify specific content of the speech data, so as to obtain the text corresponding to the speech data.
  • speech recognition technology no detailed introduction will be given here.
  • Step C Using the target business model, add punctuation marks to the target text.
  • Step D Outputting the target text with punctuation added.
  • the target business model can be used to add punctuation marks to the target text, and further output the added punctuation marks The target text of the .
  • the outputting of the target text added with the punctuation marks mentioned here may be, for example, displaying the target text with the added punctuation marks in a text input area.
  • the target text including punctuation marks can be automatically obtained according to the speech data.
  • the microphone can be used to record voice data, and then the device installed with the instant messaging software can receive the voice data entered by the user, further recognize the voice data, and use the The target business model is described, and punctuation marks are added to the target text corresponding to the voice data. Then, in the input area of the instant messaging page, input the target text added with the punctuation marks.
  • the data processing apparatus 200 may specifically include: a first acquiring unit 201 , an input unit 202 , a matching unit 203 and a determining unit 204 .
  • the first acquiring unit 201 is configured to acquire business data
  • the input unit 202 is configured to input business data into the pre-training model and the initial business model, and obtain the output of the first fully connected layer of the pre-training model and the output of the second fully connected layer of the initial business model; wherein, the pre-trained The first modeling unit of the training model is constructed by a first granularity, the second modeling unit of the initial business model is constructed by a second granularity, and the first granularity and the second granularity are different granularities;
  • a matching unit 203 configured to match the first modeling unit with the second modeling unit, and determine from the first modeling unit target first modeling units respectively matched with the second modeling unit A modeling unit, the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit;
  • the determination unit 204 is configured to perform knowledge on the initial business model according to the output of the second fully connected layer and the output of the fully connected layer corresponding to the target first modeling unit in the output of the first fully connected layer. Distill to get the target business model.
  • the first granularity is a single character
  • the second granularity is word segmentation
  • the initial business model and the target business model are used to add punctuation marks to the input text.
  • the matching unit 203 is configured to:
  • a target character set constituting the target participle is matched from the first modeling unit, and the target character set includes a plurality of characters;
  • the device also includes:
  • the second acquisition unit is used to acquire voice data
  • a recognition unit configured to recognize the speech data to obtain the target text corresponding to the speech data
  • a processing unit configured to use the target business model to add punctuation marks to the target text
  • An output unit configured to output the target text with punctuation added.
  • the business data is text without punctuation marks.
  • model parameter scale of the pre-training model is larger than the model parameter scale of the initial business model
  • the pre-training model is obtained by training based on pre-training data, and the data volume of the pre-training data is larger than the data volume of service data required for training to obtain the target service model.
  • the device 200 is a device corresponding to the method provided by the above method embodiment, the specific implementation of each unit of the device 200 is the same concept as the above method embodiment, therefore, regarding each unit of the device 200 For specific implementation, reference may be made to the description of the above method embodiments, and details are not repeated here.
  • the embodiment of the present application also provides a computer device, which is the computer device described above, and may include a terminal device or a server, and the aforementioned data processing device may be configured in the computer device.
  • a computer device which is the computer device described above, and may include a terminal device or a server, and the aforementioned data processing device may be configured in the computer device.
  • the computer equipment will be introduced below in conjunction with the accompanying drawings.
  • FIG. 4 shows a block diagram of a terminal device 300 .
  • the terminal device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • terminal device 300 may comprise following one or more components: processing component 302, memory 304, power supply component 306, multimedia component 308, audio component 310, interface 33 of input/output (I/O), sensor component 314 , and the communication component 316.
  • the processing component 302 generally controls the overall operations of the terminal device 300, such as operations associated with display, telephone calls, data communication, camera operations, and recording operations.
  • the processing element 302 may include one or more processors 320 to execute instructions to complete all or part of the steps of the above method.
  • processing component 302 may include one or more modules that facilitate interaction between processing component 302 and other components.
  • processing component 302 may include a multimedia module to facilitate interaction between multimedia component 308 and processing component 302 .
  • the memory 304 is configured to store various types of data to support operations at the terminal device 300 .
  • the power supply component 306 provides power to various components of the terminal device 300 .
  • the multimedia component 308 includes a screen providing an output interface between the terminal device 300 and the user.
  • the audio component 310 is configured to output and/or input audio signals.
  • the I/O interface provides an interface between the processing component 302 and the peripheral interface modules.
  • the sensor component 314 includes one or more sensors for providing various aspects of status assessment for the terminal device 300 .
  • the communication component 316 is configured to facilitate wired or wireless communications between the terminal device 300 and other devices.
  • the terminal device 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the following methods:
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DSPDs Digital Signal Processing Devices
  • PLDs Programmable Logic Devices
  • FPGA Field Programmable Programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the following methods:
  • the model unit is constructed by a first granularity
  • the second modeling unit of the initial business model is constructed by a second granularity
  • the first granularity and the second granularity are different granularities
  • the fully connected layer output of the second modeling unit has the same sequence length as the fully connected layer output of the matched target first modeling unit;
  • the embodiment of the present application further provides a server, as shown in FIG. 5 , which is a schematic structural diagram of the server in the embodiment of the present application.
  • the server 400 can have relatively large differences due to different configurations or performances, and can include one or more central processing units (central processing units, CPU) 422 (for example, one or more processors) and memory 432, one or one
  • the storage medium 430 (for example, one or more mass storage devices) for storing application programs 442 or data 444 .
  • the memory 432 and the storage medium 430 may be temporary storage or persistent storage.
  • the program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 422 may be configured to communicate with the storage medium 430 , and execute a series of instruction operations in the storage medium 430 on the server 400 .
  • the server 400 can also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input and output interfaces 456, one or more keyboards 456, and/or, one or more operating systems 441 , such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and so on.
  • operating systems 441 such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and so on.
  • the steps performed by the server in the foregoing embodiments may be based on the server structure shown in FIG. 5 .
  • an embodiment of the present application further provides a storage medium, where the storage medium is used to store a computer program, and the computer program is used to execute the method provided in the foregoing embodiments.
  • the embodiment of the present application also provides a computer program product including instructions, which, when run on a computer, causes the computer to execute the method provided in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

本申请公开了一种数据处理方法,应用于人工智能、自然语言处理等领域。基于获取的业务数据,输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出。由于预训练模型和初始业务模型通过不同粒度构建了建模单元,将预训练模型的第一建模单元与初始业务模型的第二建模单元进行匹配,第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度,由此达成进行知识蒸馏的基础,从而可以在预训练模型的协助下,通过初始业务模型得到目标业务模型,实现了对包括不同粒度的建模单元的模型进行知识蒸馏,从而扩展了知识蒸馏的适用范围,有效提升了业务模型的业务性能。

Description

一种数据处理方法、装置、设备、存储介质和程序产品
本申请要求于2021年09月17日提交中国专利局、申请号为202111094328.0、申请名称为“一种数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习领域,特别是涉及数据处理。
背景技术
目前,预训练模型发展突飞猛进。预训练可以通过自监督学习从大规模数据中获得预训练模型。并且,预训练模型可以将从大规模数据中学习到的知识迁移到其他小规模模型中,以实现与预训练模型的具体任务无关的其他任务。
也就是说,通过预训练模型可以优化业务模型,使得业务模型可以更好的在其他任务进行服务。
但是,在一些场景下,并不能使用预训练模型对业务模型进行优化,从而使得业务模型的效果无法得到提升。
发明内容
有鉴于此,本申请实施例提供一种数据处理方法、装置、设备、存储介质和程序产品,扩展了知识蒸馏的适用范围,有效提升了业务模型的业务性能。
为实现上述目的,本申请实施例提供如下技术方案:
一方面,本申请实施例提供了一种数据处理方法,所述方法包括:
获取业务数据;
将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
根据所述初始业务模型的第二全连接层输出、以及所述第一全连接层输出 中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
另一方面,本申请实施例提供了一种数据处理装置,所述装置包括:
第一获取单元,用于获取业务数据;
输入单元,用于将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
匹配单元,用于将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
确定单元,用于根据所述第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
再一方面,本申请实施例提供一种计算机设备,包括:
处理器、通信接口、存储器和通信总线;
其中,所述处理器、所述通信接口和所述存储器通过所述通信总线完成相互间的通信;所述通信接口为通信模块的接口;
所述存储器,用于存储程序代码,并将所述程序代码传输给所述处理器;处理器,用于调用存储器中程序代码的指令执行以上方面所述的方法。
又一方面,本申请实施例提供一种存储介质,所述存储介质用于存储计算机程序,所述计算机程序用于执行以上方面所述的方法。
又一方面,本申请实施例提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行以上方面所述的方法。
与相关技术相比,本申请实施例具有以下优点:
基于获取的业务数据,输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出。由于预训练模型和初始业务模型通过不同粒度构建了建模单元,导致第一全连接层输 出的序列长度、与第二全连接层输出的序列长度不同。在这种情况下,为了实现通过预训练模型对初始业务模型进行知识蒸馏,将所述预训练模型的第一建模单元与所述初始业务模型的第二建模单元进行匹配,第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度,由此达成进行知识蒸馏的基础,从而可以在预训练模型的协助下,通过初始业务模型得到目标业务模型,实现了对包括不同粒度的建模单元的模型进行知识蒸馏。由此可见,即使预训练模型和初始业务模型具有不同粒度的建模单元,也能够使用预训练模型对初始业务模型进行优化,得到目标业务模型,从而扩展了知识蒸馏的适用范围,有效提升了业务模型的业务性能。
附图说明
图1为本申请实施例提供的一种数据处理方法的流程示意图;
图2为本申请实施例提供的一种数据处理装置的结构示意图;
图3为本申请实施例提供的客户端的结构示意图;
图4为本申请实施例提供的服务器的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的发明人经过研究发现,可以通过知识蒸馏的方式,利用预训练模型对业务模型进行优化。但是,使用知识蒸馏有一个前提,就是预训练模型的全连接层输出的序列长度和业务模型的全连接层输出的序列长度一致。这是因为在进行知识蒸馏时,要根据预训练模型的全连接层输出和业务模型的全连接层输出计算相对熵散度损失函数(Kullback–Leibler divergence loss,KLD loss)。举例说明:预训练模型的全连接层输出包括3个向量,分别是w1、w2和w3,业务模型的全连接层输出包括3个向量,分别是L1、L2和L3,则KLD loss=w1*L1+w2*L2+w3*L3。
在一些场景中,例如,在给文本自动添加标点符号的场景中,其对应的业 务模型的建模单元为分词,而预训练模型的建模单元为单个字符。而一个建模单元对应一个全连接层输出,各个建模单元分别对应的全连接层输出的序列长度相同(例如前述w1、w2、w3、L1、L2和L3的维度相同)。然而由于预训练模型和业务模型的建模单元构建粒度不同(预训练模型是以字符为粒度,业务模型是以分词为粒度),即预训练模型和业务模型的建模单元在数量上是有区别的,这就导致该业务模型的全连接层输出的序列长度与预训练模型的全连接层输出的序列长度不一致,从而无法进行知识蒸馏。举例说明:
对于文本“我是中国人”而言,预训练模型的输入包括5个建模单元,分别为:“我”、“是”、“中”、“国”、“人”。而业务模型的输入包括3个建模单元,分别是“我”、“是”、“中国人”。假设各个建模单元对应的全连接层输出是一个5维的向量,则对于预训练模型而言,其全连接层输出为5个5维向量,而对于业务模型而言,其全连接层输出为3个5维向量。即:业务模型的全连接层输出的序列长度与预训练模型的全连接层输出的序列长度不一致。
针对此处提及的知识蒸馏、预训练模型、业务模型、建模单元和全连接层输出,需要说明的是:
知识蒸馏指的是:将预训练模型的知识迁移到业务模型,从而对业务模型进行优化。
预训练模型,包括但不限于Bert、GPT和ELECTRA。
业务模型,包括但不限于双向长短记忆网络(Bi-directional Long-Short Term Memory,BiLSTM)模型。
建模单元,指的是以模型输入的粒度为依据建立的单元,一般为单个字符或者分词,分词可以包括一个或者多个字符。
全连接层输出为未经归一化的标签概率值向量。例如,在给文本添加标点符号的场景中,可选的标点符号有4种,则各个建模单元对应的全连接层输出为一个5维向量,该5维向量的具体数值用于指示该4种标点符号的概率以及无标点符号的概率。全连接层输出又可以被称为Logits输出。
可以理解的是,若预训练模型和业务模型无法进行知识蒸馏,则无法利用大规模的预训练模型的知识对小规模的业务模型进行优化,从而使得业务模型的效果无法得到提升。
为了解决上述问题,本申请实施例提供了一种数据处理方法,实现了对包括不同粒度的建模单元的模型进行知识蒸馏,从而扩展了知识蒸馏的适用范围,有效提升了业务模型的业务性能。
该数据处理方法可以通过计算机设备实施,该计算机设备可以是终端设备或服务器,其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。终端设备包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端、飞行器等。终端设备以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
本申请实施例还涉及人工智能(Artificial Intelligence,AI),人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。本申请实施例主要涉及自然语言处理技术以及机器学习。
自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模 拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
例如,本申请实施例可以通过自然语言技术对文本进行切分得到分词、字符等,并对切分结果进行特征处理得到全连接层输出。而且,通过预训练模型对初始业务模型的知识蒸馏也属于实现迁移学习的一种有效实现形式。
参见图1,该图为本申请实施例提供的一种数据处理方法的流程示意图。在本实施例中,以服务器作为前述计算机设备进行示例性说明,所述方法例如可以通过以下S101-S104实现。
S101:获取业务数据。
本申请实施例中提及的业务数据,指的是与具体业务相关的数据。本申请实施例不具体限定所述业务数据。
在一种可能的实现方式中,若所述初始业务模型和所述目标业务模型用于为输入的文本添加标点符号,则该业务数据为不具有标点符号的文本,从而通过知识蒸馏,可以让得到的目标业务模型具备为不具有标点符号的文本进行标点符号的标注功能。
该不具有标点符号的文本可以基于不同方式得到,本申请对此不做限定。例如可以通过语音识别技术对语音进行识别得到的文本。
S102:将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出。
其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度。
在本申请实施例中,所述预训练模型和所述初始业务模型可以是预先训练得到的。在一个示例中,所述预训练模型可以是根据预训练数据和业务数据训练得到的。具体地:可以利用预训练数据训练得到初始预训练模型,而后,利用业务数据对所述初始预训练模型进行微调(Finetune),得到基于业务数据所对应业务任务的预训练模型。换言之,S101中提及的预训练模型,可以是基于 该业务的预训练模型。此处提及的预训练数据,可以是与该业务任务无关的训练数据。一般情况下,所述预训练模型的数据量比较大,直接使用则会影响业务处理效率。
需要说明的是,预训练数据所对应的业务任务和业务数据所对应的业务任务可以不同。而且,在一种可能的实现方式中,预训练数据的数据量大于训练得到所述目标业务模型所需的业务数据的数据量,使得预训练数据可以基于比较充分的预训练数据进行训练,所得到的预训练模型具备较高的精度和模型参数规模。
在本申请实施例中,所述初始业务模型可以是利用业务数据训练得到的。可以理解的是,目标业务模型主要应用于较新领域中的业务任务,使得能够收集用于训练的业务数据的数据量一般远小于预训练数据的数据量,尤其是对于一些新兴业务,所能获取的业务数据的数据量更是十分有限。因此,初始业务模型的准确度往往不是特别高,其模型参数规模也会小于预训练模型,并不能满足对应的业务任务。而采用知识蒸馏的方式利用预训练模型对初始业务模型进行优化,以预训练模型作为知识蒸馏中的老师模型,初始业务模型作为知识蒸馏中的学生模型,将预训练模型的知识有效的迁移到目标业务模型中,则可以在不额外增加用于模型训练的业务数据的情况下,有效提升最终得到的目标业务模型的准确度。
在本申请实施例中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,第一粒度和第二粒度为不同粒度,由此导致预训练模型的建模单元数量与初始业务模型的建模单元数量并不一致。因此,所述预训练模型的第一全连接层输出的序列长度一般是大于所述初始业务模型的第二全连接层输出的序列长度的。由此难以使用相关技术中的知识蒸馏方式将预训练模型的知识迁移到初始业务模型中,需要通过本申请实施例提供的方式,例如通过S103-S104的方式才可以实现包括不同粒度的建模单元的两个模型间的知识蒸馏。
在一种可能的实现方式中,当所述业务数据为文本时,所述第一粒度为单个字符,所述第二粒度为分词。也就是说,在此实现方式下,由于一个分词可以包括一个或多个字符,故第一粒度要比第二粒度更细。相应的,最终得到的 目标业务模型可以实现对文本中以分词为粒度的业务任务,例如添加标点符号等。
如前所述,对于预训练模型和初始业务模型而言,均是一个建模单元对应一个全连接层输出。并且,预训练模型和初始业务模型的建模单元对应的全连接层输出的序列长度相同。因此,所述预训练模型的全连接层输出的序列长度,大于或者等于所述初始业务模型的全连接层输出的序列长度。并且,由于所述初始业务模型的每个建模单元只包括一个字符的可能性较小,因此,在大多数情况下,所述预训练模型的全连接层输出的序列长度,大于所述初始业务模型的全连接层输出的序列长度。
S103:将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元。
所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度。
S104:根据所述初始业务模型的第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
关于S102和S103,需要说明的是,本申请实施例不具体限定S102和S103的执行顺序,只要S102和S103在S101与S104之间执行即可。S102可以在S103之前执行,S102也可以和S103同时执行,S102还可以在S103之后执行。
关于S103和S104,需要说明的是,正是由于在大多数情况下,所述预训练模型的全连接层输出的序列长度,大于所述初始业务模型的全连接层输出的序列长度。因此,若直接利用所述预训练模型的全连接层输出和所述初始业务模型的全连接层输出,无法进行知识蒸馏。
鉴于此,在本申请实施例中,可以从所述预训练模型的第一建模单元中,确定出与所述初始业务模型的第二建模单元分别匹配的目标第一建模单元,针对一个第二建模单元和对应的目标第一建模单元,该两者的全连接层输出具有相同的序列长度,且由于是通过匹配的方式确定出的第二建模单元对应的目标第一建模单元,故目标第一建模单元不仅在其全连接层输出的序列长度上与对应的第二建模单元相同,而且在业务任务的处理上具有关联性,从而达到了通 过预训练模型对初始业务模型进行知识蒸馏的实现基础。
而后,利用所述初始业务模型的第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元对应的全连接层输出对初始业务模型进行知识蒸馏达到目标业务模型。换言之,在本方案中,可以从所述预训练模型的全连接层输出中,筛选出与第二建模单元的全连接层输出的序列长度相同的目标第一建模单元,从而进行知识蒸馏,得到目标业务模型。
举例说明:
业务数据为:“我是中国人”。第一粒度为单个字符,第二粒度为分词。
如图2所示,初始业务模型包括3个第二建模单元,分别为:“我”、“是”、“中国人”,其分别对应的全连接层输出分别为:a1、a2和a3。预训练模型包括5个第一建模单元,分别为:“我”、“是”、“中”、“国”、“人”,其分别对应的全连接层输出分别为:b1、b2、b3、b4和b5。从预训练模型的第一建模单元“我”、“是”、“中”、“国”、“人”中,通过匹配确定出与初始业务模型的各个第二建模单元分别对应的目标第一建模单元。例如,确定预训练模型中目标第一建模单元“我”对应初始业务模型的第二建模单元“我”;确定预训练模型中目标第一建模单元“是”对应初始业务模型的第二建模单元“是”;确定预训练模型中目标第一建模单元“人”对应初始业务模型的第二建模单元“中国人”。而后,利用a1、a2、a3和b1、b2、b5进行知识蒸馏,得到目标业务模型。在进行知识蒸馏时,可以利用公式a1*b1+a2*b2+a3*b5计算KLD loss。
再举例说明:
业务数据为:“我是中国人”。第一粒度为单个字符,第二粒度为分词。
初始业务模型包括3个第二建模单元,分别为:“我”、“是”、“中国人”,其分别对应的全连接层输出分别为:a1、a2和a3。预训练模型包括5个第一建模单元,分别为:“我”、“是”、“中”、“国”、“人”,其分别对应的全连接层输出分别为:b1、b2、b3、b4和b5。从预训练模型的第一建模单元“我”、“是”、“中”、“国”、“人”中,通过匹配确定出与初始业务模型的各个第二建模单元分别对应的目标第一建模单元。例如,确定预训练模型中目标第一建模单元“我”对应初始业务模型的第二建模单元“我”;确定预训练模型中目标第一建模单元“是”对应初始业务模型的第二建模单元“是”;确定预训练模型中目标第 一建模单元“中”对应初始业务模型的第二建模单元“中国人”。而后,利用a1、a2、a3和b1、b2、b3进行知识蒸馏,得到目标业务模型。在进行知识蒸馏时,可以利用公式a1*b1+a2*b2+a3*b3计算KLD loss。
需要说明的是,本申请实施例中的初始业务模型和目标业务模型,可以用于处理与业务数据相关的业务。在一个示例中,所述初始业务模型和基于初始业务模型训练得到的所述目标业务模型可以用于为文本添加标点符号。考虑到根据语音自动识别到的文本不包括标点符号。因此,在一些实施例中,此处提及的业务数据和后续提及的文本,均可以是通过语音自动识别的文本。
在本申请实施例中,对于所述初始业务模型和所述目标业务模型可以用于为文本添加标点符号的情况,考虑到对于一个分词而言,其分词内部不可能会被添加标点符号,标点符号可能被添加在该分词最后一个字符之后。举例说明,对于分词“中国人”而言,字符“中”和字符“国”之间,不可能被添加标点符号;字符“国”和字符“人”之间,也不可能被添加标点符号。而字符“人”之后有可能被添加标点符号。
因此,在一种可能的实现方式中,若所述第一粒度为单个字符,所述第二粒度为分词,S103可以包括:
S1031:根据所述目标第二建模单元对应的目标分词,从所述第一建模单元中匹配出构成所述目标分词的目标字符集合,所述目标字符集合包括多个字符;
S1032:基于语序,从所述目标字符集合中将最后一个字符对应的第一建模单元确定为与所述目标第二建模单元匹配的目标第一建模单元。
为方便描述,若一个第二建模单元对应的目标分词具有多个字符,将该第二建模单元记为目标第二建模单元,该目标分词对应的多个字符构成目标字符集合,该多个字符在预训练模型中具有分别对应的第一建模单元。
基于前述针对标点符号的标注规则,可以基于业务数据中文本的语序,将目标字符集合中在语义方向下的最后一个字符对应的第一建模单元确定为目标第一建模单元。
举例说明:目标字符集合为{“中”、“国”、“人”},目标分词为“中国人”,则将所述目标字符集合的最后一个字符“人”对应的第一建模单元确定为与目 标分词匹配的目标第一建模单元。
通过以上描述可知,即使预训练模型的第一全连接层输出的序列长度大于所述初始业务模型的第二全连接层输出的序列长度,故可以将所述预训练模型的第一建模单元与所述初始业务模型的第二建模单元进行匹配,第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度,由此达成进行知识蒸馏的基础,从而可以在预训练模型的协助下,通过初始业务模型得到目标业务模型,实现了对包括不同粒度的建模单元的模型进行知识蒸馏。由此可见,即使预训练模型和初始业务模型具有不同粒度的建模单元,也能够使用预训练模型对初始业务模型进行优化,得到目标业务模型,从而扩展了知识蒸馏的适用范围,有效提升了业务模型的业务性能。
在本申请实施例的一种实现方式中,得到目标业务模型之后,可以利用该目标业务模型处理相关的业务。在一个示例中,若目标业务模型用于为文本添加标点符号。则所述方法还可以包括以下步骤A-D。
步骤A:获取语音数据。
所述语音数据可以是用户通过麦克风实时录入的数据,也可以是用户提前录入并存储的数据,此处不做限定。
步骤B:对所述语音数据进行识别,得到所述语音数据对应的目标文本。
在本申请实施例中,可以利用语音识别技术,识别所述语音数据的具体内容,从而得到所述语音数据对应的文本。关于所述语音识别技术,此处不做详细介绍。
步骤C:利用所述目标业务模型,为所述目标文本添加标点符号。
步骤D:输出添加了标点符号的所述目标文本。
由于根据语音识别得到的目标文本不包括标点符号,因此,得到语音数据对应的目标文本之后,可以利用所述目标业务模型,为所述目标文本添加标点符号,并进一步输出添加了所述标点符号的所述目标文本。
此处提及的输出添加了所述标点符号的目标文本,例如可以是在文本输入区显示所述添加了所述标点符号的目标文本。通过步骤A-D,可以自动根据语音数据得到包括标点符号的目标文本。
关于步骤A和步骤D,现结合具体场景举例说明:
在即时通信场景中,用户不便手动输入文本,则可以调用麦克风录入语音数据,而后,安装所述即时通信软件的设备则可以接收用户录入的语音数据,进一步对该语音数据进行识别,并利用所述目标业务模型,为所述语音数据对应的目标文本添加标点符号。而后,在即时通信页面的输入区,输入添加了所述标点符号的目标文本。
需要说明的是,以上举例只是本申请的一种应用场景,本申请实施例所提供的方案所适用的场景不限于以上所述。
示例性设备
基于以上实施例提供的方法,本申请实施例还提供了一种装置,以下结合附图介绍该装置。
参见图3,该图为本申请实施例提供的一种数据处理装置的结构示意图。所述数据处理装置200例如可以具体包括:第一获取单元201、输入单元202、匹配单元203和确定单元204。
第一获取单元201,用于获取业务数据;
输入单元202,用于将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
匹配单元203,用于将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
确定单元204,用于根据所述第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
可选的,当所述业务数据为文本时,所述第一粒度为单个字符,所述第二粒度为分词。
可选的,所述初始业务模型和所述目标业务模型用于为输入的文本添加标 点符号。
可选的,若所述第一粒度为单个字符,所述第二粒度为分词,针对所述第二建模单元中的目标第二建模单元,所述匹配单元203,用于:
根据所述目标第二建模单元对应的目标分词,从所述第一建模单元中匹配出构成所述目标分词的目标字符集合,所述目标字符集合包括多个字符;
基于语序,从所述目标字符集合中将最后一个字符对应的第一建模单元确定为与所述目标第二建模单元匹配的目标第一建模单元。
可选的,所述装置还包括:
第二获取单元,用于获取语音数据;
识别单元,用于对所述语音数据进行识别,得到所述语音数据对应的目标文本;
处理单元,用于利用所述目标业务模型,为所述目标文本添加标点符号;
输出单元,用于输出添加了标点符号的所述目标文本。
可选的,所述业务数据为不具有标点符号的文本。
可选的,所述预训练模型的模型参数规模大于所述初始业务模型的模型参数规模;
所述预训练模型是基于预训练数据训练得到的,所述预训练数据的数据量大于训练得到所述目标业务模型所需的业务数据的数据量。
由于所述装置200是与以上方法实施例提供的方法对应的装置,所述装置200的各个单元的具体实现,均与以上方法实施例为同一构思,因此,关于所述装置200的各个单元的具体实现,可以参考以上方法实施例的描述部分,此处不再赘述。
本申请实施例还提供了一种计算机设备,该计算机设备为前述介绍的计算机设备,可以包括终端设备或服务器,前述的数据处理装置可以配置在该计算机设备中。下面结合附图对该计算机设备进行介绍。
图4示出了一种终端设备300的框图。例如,终端设备300可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图4,终端设备300可以包括以下一个或多个组件:处理组件302,存 储器304,电源组件306,多媒体组件308,音频组件310,输入/输出(I/O)的接口33,传感器组件314,以及通信组件316。
处理组件302通常控制终端设备300的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件302可以包括一个或多个处理器320来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件302可以包括一个或多个模块,便于处理组件302和其他组件之间的交互。例如,处理部件302可以包括多媒体模块,以方便多媒体组件308和处理组件302之间的交互。
存储器304被配置为存储各种类型的数据以支持在终端设备300的操作。
电源组件306为终端设备300的各种组件提供电力。
多媒体组件308包括在所述终端设备300和用户之间的提供一个输出接口的屏幕。
音频组件310被配置为输出和/或输入音频信号。
I/O接口为处理组件302和外围接口模块之间提供接口。
传感器组件314包括一个或多个传感器,用于为终端设备300提供各个方面的状态评估。
通信组件316被配置为便于终端设备300和其他设备之间有线或无线方式的通信。
在示例性实施例中,终端设备300可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行下述方法:
获取业务数据;
将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单 元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
根据所述第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
若计算机设备为服务器,本申请实施例还提供一种服务器,请参见图5所示,图5是本申请实施例中服务器的结构示意图。该服务器400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)422(例如,一个或一个以上处理器)和存储器432,一个或一个以上存储应用程序442或数据444的存储介质430(例如一个或一个以上海量存储设备)。其中,存储器432和存储介质430可以是短暂存储或持久存储。存储在存储介质430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器422可以设置为与存储介质430通信,在服务器400上执行存储介质430中的一系列指令操作。
服务器400还可以包括一个或一个以上电源426,一个或一个以上有线或无线网络接口450,一个或一个以上输入输出接口456,一个或一个以上键盘456,和/或,一个或一个以上操作系统441,例如Windows Server TM,Mac OS X TM,Unix TM,Linux TM,FreeBSD TM等等。
上述实施例中由服务器所执行的步骤可以基于图5所示的服务器结构。
另外,本申请实施例还提供了一种存储介质,所述存储介质用于存储计算机程序,所述计算机程序用于执行上述实施例提供的方法。
本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例提供的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (11)

  1. 一种数据处理方法,所述方法由计算机设备执行,所述方法包括:
    获取业务数据;
    将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
    将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
    根据所述第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
  2. 根据权利要求1所述的方法,当所述业务数据为文本时,所述第一粒度为单个字符,所述第二粒度为分词。
  3. 根据权利要求1所述的方法,所述初始业务模型和所述目标业务模型用于为输入的文本添加标点符号。
  4. 根据权利要求3所述的方法,若所述第一粒度为单个字符,所述第二粒度为分词,针对所述第二建模单元中的目标第二建模单元,所述将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,包括:
    根据所述目标第二建模单元对应的目标分词,从所述第一建模单元中匹配出构成所述目标分词的目标字符集合,所述目标字符集合包括多个字符;
    基于语序,从所述目标字符集合中将最后一个字符对应的第一建模单元确定为与所述目标第二建模单元匹配的目标第一建模单元。
  5. 根据权利要求3所述的方法,所述方法还包括:
    获取语音数据;
    对所述语音数据进行识别,得到所述语音数据对应的目标文本;
    利用所述目标业务模型,为所述目标文本添加标点符号;
    输出添加了标点符号的所述目标文本。
  6. 根据权利要求3所述的方法,所述业务数据为不具有标点符号的文本。
  7. 根据权利要求1所述的方法,所述预训练模型的模型参数规模大于所述初始业务模型的模型参数规模;
    所述预训练模型是基于预训练数据训练得到的,所述预训练数据的数据量大于训练得到所述目标业务模型所需的业务数据的数据量。
  8. 一种数据处理装置,所述装置包括:
    第一获取单元,用于获取业务数据;
    输入单元,用于将业务数据输入预训练模型和初始业务模型,得到所述预训练模型的第一全连接层输出以及所述初始业务模型的第二全连接层输出;其中,所述预训练模型的第一建模单元通过第一粒度构建,所述初始业务模型的第二建模单元通过第二粒度构建,所述第一粒度和所述第二粒度为不同粒度;
    匹配单元,用于将所述第一建模单元与所述第二建模单元进行匹配,从所述第一建模单元中确定出与所述第二建模单元分别匹配的目标第一建模单元,所述第二建模单元的全连接层输出与所匹配目标第一建模单元的全连接层输出具有相同的序列长度;
    确定单元,用于根据所述第二全连接层输出、以及所述第一全连接层输出中所述目标第一建模单元分别对应的全连接层输出,对所述初始业务模型进行知识蒸馏得到目标业务模型。
  9. 一种服务器,所述服务器包括:
    处理器、通信接口、存储器和通信总线;
    其中,所述处理器、所述通信接口和所述存储器通过所述通信总线完成相互间的通信;所述通信接口为通信模块的接口;
    所述存储器,用于存储程序代码,并将所述程序代码传输给所述处理器;
    所述处理器,用于调用存储器中程序代码的指令执行权利要求1-8任意一项所述的方法。
  10. 一种计算机可读介质,其上存储有计算机程序,所述计算机程序在被处理器执行时实现如权利要求1-8任意一项所述的方法。
  11. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1-8任意一项所述的方法。
PCT/CN2022/112643 2021-09-17 2022-08-16 一种数据处理方法、装置、设备、存储介质和程序产品 WO2023040545A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111094328.0 2021-09-17
CN202111094328.0A CN113807540A (zh) 2021-09-17 2021-09-17 一种数据处理方法及装置

Publications (1)

Publication Number Publication Date
WO2023040545A1 true WO2023040545A1 (zh) 2023-03-23

Family

ID=78939748

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/112643 WO2023040545A1 (zh) 2021-09-17 2022-08-16 一种数据处理方法、装置、设备、存储介质和程序产品

Country Status (2)

Country Link
CN (1) CN113807540A (zh)
WO (1) WO2023040545A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807540A (zh) * 2021-09-17 2021-12-17 北京搜狗科技发展有限公司 一种数据处理方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832701A (zh) * 2020-06-09 2020-10-27 北京百度网讯科技有限公司 模型的蒸馏方法、装置、电子设备及存储介质
US20200401929A1 (en) * 2019-06-19 2020-12-24 Google Llc Systems and Methods for Performing Knowledge Distillation
CN112487182A (zh) * 2019-09-12 2021-03-12 华为技术有限公司 文本处理模型的训练方法、文本处理方法及装置
CN112686046A (zh) * 2021-01-06 2021-04-20 上海明略人工智能(集团)有限公司 模型训练方法、装置、设备及计算机可读介质
CN113807540A (zh) * 2021-09-17 2021-12-17 北京搜狗科技发展有限公司 一种数据处理方法及装置
CN114154395A (zh) * 2021-11-04 2022-03-08 北京搜狗科技发展有限公司 一种模型处理方法、装置和用于模型处理的装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200401929A1 (en) * 2019-06-19 2020-12-24 Google Llc Systems and Methods for Performing Knowledge Distillation
CN112487182A (zh) * 2019-09-12 2021-03-12 华为技术有限公司 文本处理模型的训练方法、文本处理方法及装置
CN111832701A (zh) * 2020-06-09 2020-10-27 北京百度网讯科技有限公司 模型的蒸馏方法、装置、电子设备及存储介质
CN112686046A (zh) * 2021-01-06 2021-04-20 上海明略人工智能(集团)有限公司 模型训练方法、装置、设备及计算机可读介质
CN113807540A (zh) * 2021-09-17 2021-12-17 北京搜狗科技发展有限公司 一种数据处理方法及装置
CN114154395A (zh) * 2021-11-04 2022-03-08 北京搜狗科技发展有限公司 一种模型处理方法、装置和用于模型处理的装置

Also Published As

Publication number Publication date
CN113807540A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2021047286A1 (zh) 文本处理模型的训练方法、文本处理方法及装置
CN108334487B (zh) 缺失语意信息补全方法、装置、计算机设备和存储介质
CN110349572B (zh) 一种语音关键词识别方法、装置、终端及服务器
CN112131366B (zh) 训练文本分类模型及文本分类的方法、装置及存储介质
WO2022095380A1 (zh) 基于ai的虚拟交互模型生成方法、装置、计算机设备及存储介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN109145213B (zh) 基于历史信息的查询推荐方法及装置
WO2021135455A1 (zh) 语义召回方法、装置、计算机设备及存储介质
CN110472002B (zh) 一种文本相似度获取方法和装置
EP3566151A1 (en) Generating responses in automated chatting
WO2020073533A1 (zh) 自动问答方法及装置
CN116561538A (zh) 问答评分方法、问答评分装置、电子设备及存储介质
CN113672708A (zh) 语言模型训练方法、问答对生成方法、装置及设备
CN111563158A (zh) 文本排序方法、排序装置、服务器和计算机可读存储介质
EP4123516A1 (en) Method and apparatus for acquiring pre-trained model, electronic device and storage medium
WO2023040545A1 (zh) 一种数据处理方法、装置、设备、存储介质和程序产品
WO2023134069A1 (zh) 实体关系的识别方法、设备及可读存储介质
CN115438149A (zh) 一种端到端模型训练方法、装置、计算机设备及存储介质
CN112232066A (zh) 一种教学纲要生成方法、装置、存储介质及电子设备
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
WO2023029354A1 (zh) 文本信息提取方法、装置、存储介质及计算机设备
CN111444321B (zh) 问答方法、装置、电子设备和存储介质
CN113342944B (zh) 一种语料泛化方法、装置、设备及存储介质
CN111767720B (zh) 一种标题生成方法、计算机及可读存储介质
CN111931503A (zh) 信息抽取方法及装置、设备、计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868921

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE