CN111104514A

CN111104514A - Method and device for training document label model

Info

Publication number: CN111104514A
Application number: CN201911338269.XA
Authority: CN
Inventors: 刘呈祥; 何伯磊; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-05
Anticipated expiration: 2039-12-23
Also published as: CN111104514B

Abstract

The application discloses a method and a device for training a document label model, and relates to the technical field of document label prediction. The specific implementation scheme is as follows: obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.

Description

Method and device for training document label model

Technical Field

The application relates to the technical field of data processing, in particular to the technical field of document label prediction, and particularly relates to a training method and device of a document label model.

Background

Currently, the label prediction technology of the document is an important work for the content understanding of the document. For a new document label prediction scene, the following two main solution ideas exist, one is to train a general document label model: and (3) the model is trained without considering the difference among the scenes, and a universal document label model is used in all scenes. The other is to train the document label model alone: training data is prepared separately for the new scene for training.

In the first method, the trained model lacks scene or domain pertinence, and the prediction accuracy in a single scene is low. In the second method, the amount of training data to be prepared is large, and the training cost is high.

Disclosure of Invention

The application provides a method and a device for training a document label model, which train a sub-model related to an application scene to be applied in a pre-trained document label model according to scene training data of the application scene to be applied, so that the training cost of the document label model under the application scene to be applied is reduced on the premise of ensuring the accuracy of the document label model.

An embodiment of one aspect of the present application provides a method for training a document tag model, including:

obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training;

acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under the application scene to be applied;

acquiring a submodel related to the application scene to be applied in the document label model;

and training the sub-model by adopting the scene training data to obtain a trained document label model.

In one embodiment of the present application, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;

the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;

the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;

the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.

In one embodiment of the present application, the sub-models related to the application scenario to be applied include: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the scene training data are adopted to train the submodel to obtain a trained document label model, and the method comprises the following steps:

for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;

inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;

and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.

In an embodiment of the present application, the scene training data further includes: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.

In an embodiment of the present application, before the training the sub-model with the scene training data to obtain the trained document tag model, the method further includes:

and initializing the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.

According to the training method of the document label model, the document label model is obtained through pre-training by acquiring the pre-trained document label model, and the document label model is obtained through pre-training by adopting general training data of each application scene; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.

Another embodiment of the present application provides a training apparatus for a document label model, including:

the acquisition module is used for acquiring a pre-trained document label model, and the document label model is obtained by adopting general training data of each application scene for pre-training;

the obtaining module is further configured to obtain scene training data of an application scene to be applied, where the scene training data includes: a plurality of documents and corresponding label information under the application scene to be applied;

the obtaining module is further configured to obtain a sub-model related to the application scenario to be applied in the document tag model;

and the training module is used for training the sub-model by adopting the scene training data to obtain a trained document label model.

In one embodiment of the present application, the sub-models related to the application scenario to be applied include: a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel, the training module is specifically configured to,

In one embodiment of the present application, the apparatus further comprises: and the initialization module is used for carrying out initialization operation on the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.

The training device for the document label model is obtained by obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.

An embodiment of another aspect of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the training method of the document tag model of the embodiment of the application.

Another embodiment of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a training method of a document tag model according to an embodiment of the present application.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a document tag model structure.

FIG. 3 is a schematic diagram according to a second embodiment of the present application;

FIG. 4 is a schematic illustration according to a third embodiment of the present application;

FIG. 5 is a block diagram of an electronic device for implementing a method for training a document tag model according to an embodiment of the present application;

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a method and an apparatus for training a document tag model according to an embodiment of the present application with reference to the drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that an execution subject of the training method for the document tag model provided in this embodiment is a training device for the document tag model, the device may be implemented in a software and/or hardware manner, and the device may be configured in a terminal device or a server, which is not specifically limited in this embodiment.

As shown in fig. 1, the method for training the document label model may include:

step 101, obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training.

In the present application, a schematic diagram of a structure of a document tag model may be as shown in fig. 2, where in fig. 2, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse ranking layer and a fine ranking layer. The preprocessing layer is used for carrying out segmentation, sentence segmentation, word segmentation, part of speech tagging POS, named entity recognition NER and other processing on the document to obtain a preprocessing result; the preprocessing result comprises: segmentation results, sentence segmentation results, word segmentation results, part of speech tagging results and named entity identification results.

Wherein the candidate recall layer comprises: a keyword recall submodel, a multi-tag classification recall submodel, an explicit recall submodel, and an implicit recall submodel in parallel. The input of the 4 recalling submodels is a document and a preprocessing result corresponding to the document; and outputting a plurality of candidate labels. And combining the output results of the 4 recalling submodels to obtain a candidate label result. And the keyword recall submodel is used for determining candidate tags by analyzing the semantic structure of the document and counting the characteristics. A multi-label classification recall submodel for determining candidate labels based on the multi-label classification of the NN. An explicit recall submodel determines candidate tags based on literal matching and frequency screening. And (4) implicitly recalling the submodel, and determining candidate labels based on the primary and secondary component analysis.

Wherein, thick row layer includes: and the rule submodel and the semantic matching submodel are connected in parallel. The rule sub-model is used for determining candidate tags to be filtered in the candidate tag result according to a preset rule. And the semantic matching sub-model is used for determining the text relevance of each candidate tag in the document and candidate tag results, and determining the candidate tag to be filtered according to the text relevance. And filtering the candidate labels to be filtered in the candidate label results to obtain filtered candidate label results. The text relevancy refers to the similarity of semantic levels between the text and the candidate labels.

The fine ranking layer is used for ranking the candidate labels according to the text relevance, the label heat and the label granularity of the candidate labels in the filtered candidate label result, and predicting label information corresponding to the document according to the ranking result. The tag popularity refers to popularity of interest of the candidate tag for the user, for example, popularity of search of the candidate tag. The tag granularity is determined based on the component word type and length calculation of the candidate tags. The more detailed the contents of the candidate tags, the smaller the tag granularity. For example, sorted by tag granularity, then Baidu- > Baidu alliance Peak meeting; entertainment- > entertainment star.

In the application, for example, the application scenarios include performing duplicate tag prediction on long documents, performing re-accurate tag prediction on questions and answers, performing re-recall tag prediction on original content of a user, and the like. Wherein predicting the object may include: long documents, questions and answers, user-originated content, and the like. Forecasting demand such as recalls, re-accuracies, re-entities, re-classifications, high commercial value, etc.

In the present application, the general training data of each application scenario may refer to, for example, training data obtained by combining training data of each application scenario. In the application, before the application scenes to be applied are determined, a large amount of general training data of each application scene can be adopted to pre-train the initial document label model, so that the number of training data in the application scenes to be applied is reduced after the application scenes to be applied are determined.

102, acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: and the plurality of documents and the corresponding label information under the application scene to be applied.

And 103, acquiring a sub-model related to the application scene to be applied in the document tag model.

In the present application, the submodels related to the application scenario to be applied include: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel. In the application, the sub-models can be selected from the sub-models to retrain or fine tune according to specific application scenarios to be applied.

And 104, training the sub-model by using the scene training data to obtain a trained document label model.

In the present application, the submodels related to the application scenario to be applied include: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the process of the training device of the document label model executing the step 104 may specifically be that, for each document in the scene training data, the document is respectively input into the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel, and each output result is combined to obtain a candidate label result; inputting the document and the candidate tag result into a semantic matching sub-model, and acquiring the correlation degree of each candidate tag in the document and candidate tag result; and adjusting coefficients of the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel according to the relevance between the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.

In this application, in order to improve the accuracy of the trained document tag model, the scene training data may further include: a labelset, the labelset comprising: the document label model can predict labels, so that the document label model performs label prediction on the documents in the scene training data by combining the label set.

In this application, before step 104, the method may further include the following steps: and initializing the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model so as to avoid the interference of the coefficients of the submodels in the pre-trained document label model when the submodel is trained in the application scene to be applied and further improve the accuracy of the document label model in the application scene to be applied.

In order to implement the above embodiments, an embodiment of the present application further provides a device for training a document tag model.

Fig. 3 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 3, the training apparatus 100 for the document tag model includes:

an obtaining module 110, configured to obtain a pre-trained document tag model, where the document tag model is obtained by pre-training using general training data of each application scenario;

the obtaining module 110 is further configured to obtain scene training data of an application scene to be applied, where the scene training data includes: a plurality of documents and corresponding label information under the application scene to be applied;

the obtaining module 110 is further configured to obtain a sub-model related to the application scenario to be applied in the document tag model;

and the training module 120 is configured to train the sub-model by using the scene training data to obtain a trained document label model.

In one embodiment of the present application, the sub-models related to the application scenario to be applied include: semantic matching submodel, multi-label classification recall submodel, explicit recall submodel, and implicit recall submodel, the training module 120 is specifically configured to,

In an embodiment of the present application, with reference to fig. 4, the apparatus further includes: the initialization module 130 initializes coefficients of the multi-tag classification recall submodel, the explicit recall submodel, and the implicit recall submodel in the document tag model.

It should be noted that the explanation of the training method for the document label model is also applicable to the training apparatus for the document label model of this embodiment, and is not repeated here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 301, memory 302, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 301 is taken as an example.

Memory 302 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of training a document tag model provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the document tag model provided herein.

The memory 302 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the document tag model in the embodiment of the present application (e.g., the obtaining module 110, the training module 120 shown in fig. 3, and the initializing module 130 shown in fig. 4). The processor 301 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 302, that is, implements the training method of the document tag model in the above method embodiments.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the trained electronic device of the text label model, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 302 optionally includes memory located remotely from the processor 301, and these remote memories may be connected to a trained electronic device of the document tag model over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of training of a document tag model may further comprise: an input device 303 and an output device 304. The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the trained electronic device of the document tag model, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 304 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for training a document label model is characterized by comprising the following steps:

2. The method of claim 1, wherein the document tag model comprises: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;

3. The method according to claim 2, wherein the sub-models related to the application scenario to be adapted to comprise: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the scene training data are adopted to train the submodel to obtain a trained document label model, and the method comprises the following steps:

4. The method of claim 1, wherein the scene training data further comprises: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.

5. The method of claim 3, wherein before the training the sub-model with the scene training data to obtain the trained document label model, further comprising:

6. An apparatus for training a document label model, comprising:

7. The apparatus of claim 6, wherein the document tag model comprises: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;

8. The apparatus of claim 7, wherein the sub-models related to the application scenario to be applied comprise: a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel, the training module is specifically configured to,

9. The apparatus of claim 6, wherein the scene training data further comprises: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.

10. The apparatus of claim 8, further comprising: and the initialization module is used for carrying out initialization operation on the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.