CN111104514A - Method and device for training document label model - Google Patents

Method and device for training document label model Download PDF

Info

Publication number
CN111104514A
CN111104514A CN201911338269.XA CN201911338269A CN111104514A CN 111104514 A CN111104514 A CN 111104514A CN 201911338269 A CN201911338269 A CN 201911338269A CN 111104514 A CN111104514 A CN 111104514A
Authority
CN
China
Prior art keywords
submodel
label
recall
document
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911338269.XA
Other languages
Chinese (zh)
Other versions
CN111104514B (en
Inventor
刘呈祥
何伯磊
肖欣延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201911338269.XA priority Critical patent/CN111104514B/en
Publication of CN111104514A publication Critical patent/CN111104514A/en
Application granted granted Critical
Publication of CN111104514B publication Critical patent/CN111104514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for training a document label model, and relates to the technical field of document label prediction. The specific implementation scheme is as follows: obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.

Description

Method and device for training document label model
Technical Field
The application relates to the technical field of data processing, in particular to the technical field of document label prediction, and particularly relates to a training method and device of a document label model.
Background
Currently, the label prediction technology of the document is an important work for the content understanding of the document. For a new document label prediction scene, the following two main solution ideas exist, one is to train a general document label model: and (3) the model is trained without considering the difference among the scenes, and a universal document label model is used in all scenes. The other is to train the document label model alone: training data is prepared separately for the new scene for training.
In the first method, the trained model lacks scene or domain pertinence, and the prediction accuracy in a single scene is low. In the second method, the amount of training data to be prepared is large, and the training cost is high.
Disclosure of Invention
The application provides a method and a device for training a document label model, which train a sub-model related to an application scene to be applied in a pre-trained document label model according to scene training data of the application scene to be applied, so that the training cost of the document label model under the application scene to be applied is reduced on the premise of ensuring the accuracy of the document label model.
An embodiment of one aspect of the present application provides a method for training a document tag model, including:
obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training;
acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under the application scene to be applied;
acquiring a submodel related to the application scene to be applied in the document label model;
and training the sub-model by adopting the scene training data to obtain a trained document label model.
In one embodiment of the present application, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;
the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;
the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;
the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.
In one embodiment of the present application, the sub-models related to the application scenario to be applied include: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the scene training data are adopted to train the submodel to obtain a trained document label model, and the method comprises the following steps:
for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;
inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;
and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
In an embodiment of the present application, the scene training data further includes: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.
In an embodiment of the present application, before the training the sub-model with the scene training data to obtain the trained document tag model, the method further includes:
and initializing the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.
According to the training method of the document label model, the document label model is obtained through pre-training by acquiring the pre-trained document label model, and the document label model is obtained through pre-training by adopting general training data of each application scene; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.
Another embodiment of the present application provides a training apparatus for a document label model, including:
the acquisition module is used for acquiring a pre-trained document label model, and the document label model is obtained by adopting general training data of each application scene for pre-training;
the obtaining module is further configured to obtain scene training data of an application scene to be applied, where the scene training data includes: a plurality of documents and corresponding label information under the application scene to be applied;
the obtaining module is further configured to obtain a sub-model related to the application scenario to be applied in the document tag model;
and the training module is used for training the sub-model by adopting the scene training data to obtain a trained document label model.
In one embodiment of the present application, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;
the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;
the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;
the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.
In one embodiment of the present application, the sub-models related to the application scenario to be applied include: a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel, the training module is specifically configured to,
for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;
inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;
and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
In an embodiment of the present application, the scene training data further includes: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.
In one embodiment of the present application, the apparatus further comprises: and the initialization module is used for carrying out initialization operation on the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.
The training device for the document label model is obtained by obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.
An embodiment of another aspect of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the training method of the document tag model of the embodiment of the application.
Another embodiment of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a training method of a document tag model according to an embodiment of the present application.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a document tag model structure.
FIG. 3 is a schematic diagram according to a second embodiment of the present application;
FIG. 4 is a schematic illustration according to a third embodiment of the present application;
FIG. 5 is a block diagram of an electronic device for implementing a method for training a document tag model according to an embodiment of the present application;
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The following describes a method and an apparatus for training a document tag model according to an embodiment of the present application with reference to the drawings.
Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that an execution subject of the training method for the document tag model provided in this embodiment is a training device for the document tag model, the device may be implemented in a software and/or hardware manner, and the device may be configured in a terminal device or a server, which is not specifically limited in this embodiment.
As shown in fig. 1, the method for training the document label model may include:
step 101, obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training.
In the present application, a schematic diagram of a structure of a document tag model may be as shown in fig. 2, where in fig. 2, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse ranking layer and a fine ranking layer. The preprocessing layer is used for carrying out segmentation, sentence segmentation, word segmentation, part of speech tagging POS, named entity recognition NER and other processing on the document to obtain a preprocessing result; the preprocessing result comprises: segmentation results, sentence segmentation results, word segmentation results, part of speech tagging results and named entity identification results.
Wherein the candidate recall layer comprises: a keyword recall submodel, a multi-tag classification recall submodel, an explicit recall submodel, and an implicit recall submodel in parallel. The input of the 4 recalling submodels is a document and a preprocessing result corresponding to the document; and outputting a plurality of candidate labels. And combining the output results of the 4 recalling submodels to obtain a candidate label result. And the keyword recall submodel is used for determining candidate tags by analyzing the semantic structure of the document and counting the characteristics. A multi-label classification recall submodel for determining candidate labels based on the multi-label classification of the NN. An explicit recall submodel determines candidate tags based on literal matching and frequency screening. And (4) implicitly recalling the submodel, and determining candidate labels based on the primary and secondary component analysis.
Wherein, thick row layer includes: and the rule submodel and the semantic matching submodel are connected in parallel. The rule sub-model is used for determining candidate tags to be filtered in the candidate tag result according to a preset rule. And the semantic matching sub-model is used for determining the text relevance of each candidate tag in the document and candidate tag results, and determining the candidate tag to be filtered according to the text relevance. And filtering the candidate labels to be filtered in the candidate label results to obtain filtered candidate label results. The text relevancy refers to the similarity of semantic levels between the text and the candidate labels.
The fine ranking layer is used for ranking the candidate labels according to the text relevance, the label heat and the label granularity of the candidate labels in the filtered candidate label result, and predicting label information corresponding to the document according to the ranking result. The tag popularity refers to popularity of interest of the candidate tag for the user, for example, popularity of search of the candidate tag. The tag granularity is determined based on the component word type and length calculation of the candidate tags. The more detailed the contents of the candidate tags, the smaller the tag granularity. For example, sorted by tag granularity, then Baidu- > Baidu alliance Peak meeting; entertainment- > entertainment star.
In the application, for example, the application scenarios include performing duplicate tag prediction on long documents, performing re-accurate tag prediction on questions and answers, performing re-recall tag prediction on original content of a user, and the like. Wherein predicting the object may include: long documents, questions and answers, user-originated content, and the like. Forecasting demand such as recalls, re-accuracies, re-entities, re-classifications, high commercial value, etc.
In the present application, the general training data of each application scenario may refer to, for example, training data obtained by combining training data of each application scenario. In the application, before the application scenes to be applied are determined, a large amount of general training data of each application scene can be adopted to pre-train the initial document label model, so that the number of training data in the application scenes to be applied is reduced after the application scenes to be applied are determined.
102, acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: and the plurality of documents and the corresponding label information under the application scene to be applied.
And 103, acquiring a sub-model related to the application scene to be applied in the document tag model.
In the present application, the submodels related to the application scenario to be applied include: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel. In the application, the sub-models can be selected from the sub-models to retrain or fine tune according to specific application scenarios to be applied.
And 104, training the sub-model by using the scene training data to obtain a trained document label model.
In the present application, the submodels related to the application scenario to be applied include: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the process of the training device of the document label model executing the step 104 may specifically be that, for each document in the scene training data, the document is respectively input into the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel, and each output result is combined to obtain a candidate label result; inputting the document and the candidate tag result into a semantic matching sub-model, and acquiring the correlation degree of each candidate tag in the document and candidate tag result; and adjusting coefficients of the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel according to the relevance between the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
In this application, in order to improve the accuracy of the trained document tag model, the scene training data may further include: a labelset, the labelset comprising: the document label model can predict labels, so that the document label model performs label prediction on the documents in the scene training data by combining the label set.
In this application, before step 104, the method may further include the following steps: and initializing the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model so as to avoid the interference of the coefficients of the submodels in the pre-trained document label model when the submodel is trained in the application scene to be applied and further improve the accuracy of the document label model in the application scene to be applied.
According to the training method of the document label model, the document label model is obtained through pre-training by acquiring the pre-trained document label model, and the document label model is obtained through pre-training by adopting general training data of each application scene; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.
In order to implement the above embodiments, an embodiment of the present application further provides a device for training a document tag model.
Fig. 3 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 3, the training apparatus 100 for the document tag model includes:
an obtaining module 110, configured to obtain a pre-trained document tag model, where the document tag model is obtained by pre-training using general training data of each application scenario;
the obtaining module 110 is further configured to obtain scene training data of an application scene to be applied, where the scene training data includes: a plurality of documents and corresponding label information under the application scene to be applied;
the obtaining module 110 is further configured to obtain a sub-model related to the application scenario to be applied in the document tag model;
and the training module 120 is configured to train the sub-model by using the scene training data to obtain a trained document label model.
In one embodiment of the present application, the document tag model includes: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;
the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;
the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;
the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.
In one embodiment of the present application, the sub-models related to the application scenario to be applied include: semantic matching submodel, multi-label classification recall submodel, explicit recall submodel, and implicit recall submodel, the training module 120 is specifically configured to,
for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;
inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;
and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
In an embodiment of the present application, the scene training data further includes: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.
In an embodiment of the present application, with reference to fig. 4, the apparatus further includes: the initialization module 130 initializes coefficients of the multi-tag classification recall submodel, the explicit recall submodel, and the implicit recall submodel in the document tag model.
It should be noted that the explanation of the training method for the document label model is also applicable to the training apparatus for the document label model of this embodiment, and is not repeated here.
The training device for the document label model is obtained by obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training; acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under an application scene to be applied; acquiring a submodel related to an application scene to be applied in a document tag model; the sub-models are trained by adopting the scene training data to obtain the trained document label model, so that the training data required for training the document label model in the application scene to be applied can be reduced, and the training cost is reduced under the condition of ensuring the accuracy of the document label model.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic apparatus includes: one or more processors 301, memory 302, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 301 is taken as an example.
Memory 302 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of training a document tag model provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the document tag model provided herein.
The memory 302 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the document tag model in the embodiment of the present application (e.g., the obtaining module 110, the training module 120 shown in fig. 3, and the initializing module 130 shown in fig. 4). The processor 301 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 302, that is, implements the training method of the document tag model in the above method embodiments.
The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the trained electronic device of the text label model, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 302 optionally includes memory located remotely from the processor 301, and these remote memories may be connected to a trained electronic device of the document tag model over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method of training of a document tag model may further comprise: an input device 303 and an output device 304. The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the trained electronic device of the document tag model, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 304 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. A method for training a document label model is characterized by comprising the following steps:
obtaining a pre-trained document label model, wherein the document label model is obtained by adopting general training data of each application scene for pre-training;
acquiring scene training data of an application scene to be applied, wherein the scene training data comprises: a plurality of documents and corresponding label information under the application scene to be applied;
acquiring a submodel related to the application scene to be applied in the document label model;
and training the sub-model by adopting the scene training data to obtain a trained document label model.
2. The method of claim 1, wherein the document tag model comprises: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;
the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;
the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;
the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.
3. The method according to claim 2, wherein the sub-models related to the application scenario to be adapted to comprise: when the semantic matching submodel, the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel are used, the scene training data are adopted to train the submodel to obtain a trained document label model, and the method comprises the following steps:
for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;
inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;
and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
4. The method of claim 1, wherein the scene training data further comprises: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.
5. The method of claim 3, wherein before the training the sub-model with the scene training data to obtain the trained document label model, further comprising:
and initializing the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.
6. An apparatus for training a document label model, comprising:
the acquisition module is used for acquiring a pre-trained document label model, and the document label model is obtained by adopting general training data of each application scene for pre-training;
the obtaining module is further configured to obtain scene training data of an application scene to be applied, where the scene training data includes: a plurality of documents and corresponding label information under the application scene to be applied;
the obtaining module is further configured to obtain a sub-model related to the application scenario to be applied in the document tag model;
and the training module is used for training the sub-model by adopting the scene training data to obtain a trained document label model.
7. The apparatus of claim 6, wherein the document tag model comprises: the device comprises a pretreatment layer, a candidate recall layer, a coarse arrangement layer and a fine arrangement layer;
the candidate recall layer comprises: the system comprises a keyword recall submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel which are connected in parallel;
the coarse exhaust layer comprises: the rule submodel and the semantic matching submodel are connected in parallel;
the sub-models related to the application scenarios to be applied comprise: a semantic matching submodel, and any one or more of the following submodels: a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel.
8. The apparatus of claim 7, wherein the sub-models related to the application scenario to be applied comprise: a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel, and an implicit recall submodel, the training module is specifically configured to,
for each document in the scene training data, respectively inputting the document into a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel, and combining output results to obtain candidate label results;
inputting the document and the candidate label result into the semantic matching sub-model to obtain the correlation degree of each candidate label in the document and the candidate label result;
and adjusting coefficients of a semantic matching submodel, a multi-label classification recall submodel, an explicit recall submodel and an implicit recall submodel according to the relevance of the document and each candidate label in the candidate label result and label information corresponding to the document to obtain a trained document label model.
9. The apparatus of claim 6, wherein the scene training data further comprises: a labelset, the labelset comprising: and the document label model can predict labels, so that the document label model combines the label set to perform label prediction on the documents in the scene training data.
10. The apparatus of claim 8, further comprising: and the initialization module is used for carrying out initialization operation on the coefficients of the multi-label classification recall submodel, the explicit recall submodel and the implicit recall submodel in the document label model.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.
CN201911338269.XA 2019-12-23 2019-12-23 Training method and device for document tag model Active CN111104514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911338269.XA CN111104514B (en) 2019-12-23 2019-12-23 Training method and device for document tag model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911338269.XA CN111104514B (en) 2019-12-23 2019-12-23 Training method and device for document tag model

Publications (2)

Publication Number Publication Date
CN111104514A true CN111104514A (en) 2020-05-05
CN111104514B CN111104514B (en) 2023-04-25

Family

ID=70423892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911338269.XA Active CN111104514B (en) 2019-12-23 2019-12-23 Training method and device for document tag model

Country Status (1)

Country Link
CN (1) CN111104514B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581545A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Method for sorting recalled documents and related equipment
CN111783448A (en) * 2020-06-23 2020-10-16 北京百度网讯科技有限公司 Document dynamic adjustment method, device, equipment and readable storage medium
CN111782949A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method and apparatus for generating information
CN111858895A (en) * 2020-07-30 2020-10-30 阳光保险集团股份有限公司 Sequencing model determining method, sequencing device and electronic equipment
CN112149733A (en) * 2020-09-23 2020-12-29 北京金山云网络技术有限公司 Model training method, model training device, quality determining method, quality determining device, electronic equipment and storage medium
CN112560402A (en) * 2020-12-28 2021-03-26 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN112580706A (en) * 2020-12-11 2021-03-30 北京地平线机器人技术研发有限公司 Training data processing method and device applied to data management platform and electronic equipment
CN113011490A (en) * 2021-03-16 2021-06-22 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN113239128A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN112784033B (en) * 2021-01-29 2023-11-03 北京百度网讯科技有限公司 Aging grade identification model training and application method and electronic equipment
CN117456416A (en) * 2023-11-03 2024-01-26 北京饼干科技有限公司 Method and system for intelligently generating material labels
CN117456416B (en) * 2023-11-03 2024-06-07 北京饼干科技有限公司 Method and system for intelligently generating material labels

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015187155A1 (en) * 2014-06-04 2015-12-10 Waterline Data Science, Inc. Systems and methods for management of data platforms
US20160162468A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods and systems for providing universal portability in machine learning
US20160162456A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods for generating natural language processing systems
CN108153856A (en) * 2017-12-22 2018-06-12 北京百度网讯科技有限公司 For the method and apparatus of output information
CN108304439A (en) * 2017-10-30 2018-07-20 腾讯科技(深圳)有限公司 A kind of semantic model optimization method, device and smart machine, storage medium
CN108733779A (en) * 2018-05-04 2018-11-02 百度在线网络技术(北京)有限公司 The method and apparatus of text figure
CN109376222A (en) * 2018-09-27 2019-02-22 国信优易数据有限公司 Question and answer matching degree calculation method, question and answer automatic matching method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015187155A1 (en) * 2014-06-04 2015-12-10 Waterline Data Science, Inc. Systems and methods for management of data platforms
US20160162468A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods and systems for providing universal portability in machine learning
US20160162456A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods for generating natural language processing systems
CN108304439A (en) * 2017-10-30 2018-07-20 腾讯科技(深圳)有限公司 A kind of semantic model optimization method, device and smart machine, storage medium
CN108153856A (en) * 2017-12-22 2018-06-12 北京百度网讯科技有限公司 For the method and apparatus of output information
CN108733779A (en) * 2018-05-04 2018-11-02 百度在线网络技术(北京)有限公司 The method and apparatus of text figure
CN109376222A (en) * 2018-09-27 2019-02-22 国信优易数据有限公司 Question and answer matching degree calculation method, question and answer automatic matching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢晨阳: "基于层次监督的多标签文档分类问题研究" *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581545B (en) * 2020-05-12 2023-09-19 腾讯科技(深圳)有限公司 Method for sorting recall documents and related equipment
CN111581545A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Method for sorting recalled documents and related equipment
CN111783448A (en) * 2020-06-23 2020-10-16 北京百度网讯科技有限公司 Document dynamic adjustment method, device, equipment and readable storage medium
CN111783448B (en) * 2020-06-23 2024-03-15 北京百度网讯科技有限公司 Document dynamic adjustment method, device, equipment and readable storage medium
CN111782949A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method and apparatus for generating information
CN111858895A (en) * 2020-07-30 2020-10-30 阳光保险集团股份有限公司 Sequencing model determining method, sequencing device and electronic equipment
CN111858895B (en) * 2020-07-30 2024-04-05 阳光保险集团股份有限公司 Sequencing model determining method, sequencing device and electronic equipment
CN112149733A (en) * 2020-09-23 2020-12-29 北京金山云网络技术有限公司 Model training method, model training device, quality determining method, quality determining device, electronic equipment and storage medium
CN112149733B (en) * 2020-09-23 2024-04-05 北京金山云网络技术有限公司 Model training method, model quality determining method, model training device, model quality determining device, electronic equipment and storage medium
CN112580706A (en) * 2020-12-11 2021-03-30 北京地平线机器人技术研发有限公司 Training data processing method and device applied to data management platform and electronic equipment
CN112580706B (en) * 2020-12-11 2024-05-17 北京地平线机器人技术研发有限公司 Training data processing method and device applied to data management platform and electronic equipment
CN112560402A (en) * 2020-12-28 2021-03-26 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN112784033B (en) * 2021-01-29 2023-11-03 北京百度网讯科技有限公司 Aging grade identification model training and application method and electronic equipment
CN113011490A (en) * 2021-03-16 2021-06-22 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN113011490B (en) * 2021-03-16 2024-03-08 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN113239128A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113239128B (en) * 2021-06-01 2022-03-18 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN117456416A (en) * 2023-11-03 2024-01-26 北京饼干科技有限公司 Method and system for intelligently generating material labels
CN117456416B (en) * 2023-11-03 2024-06-07 北京饼干科技有限公司 Method and system for intelligently generating material labels

Also Published As

Publication number Publication date
CN111104514B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN111104514B (en) Training method and device for document tag model
CN111428008B (en) Method, apparatus, device and storage medium for training a model
CN111507104B (en) Method and device for establishing label labeling model, electronic equipment and readable storage medium
CN111625635A (en) Question-answer processing method, language model training method, device, equipment and storage medium
CN110674314B (en) Sentence recognition method and device
CN111125435B (en) Video tag determination method and device and computer equipment
CN112487814B (en) Entity classification model training method, entity classification device and electronic equipment
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
CN111859982B (en) Language model training method and device, electronic equipment and readable storage medium
CN111737994A (en) Method, device and equipment for obtaining word vector based on language model and storage medium
CN112036509A (en) Method and apparatus for training image recognition models
CN110674260B (en) Training method and device of semantic similarity model, electronic equipment and storage medium
CN110705460A (en) Image category identification method and device
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN111259671A (en) Semantic description processing method, device and equipment for text entity
CN111680517A (en) Method, apparatus, device and storage medium for training a model
CN111522944A (en) Method, apparatus, device and storage medium for outputting information
CN111539209A (en) Method and apparatus for entity classification
CN111582477A (en) Training method and device of neural network model
CN111310058B (en) Information theme recommendation method, device, terminal and storage medium
CN111090991A (en) Scene error correction method and device, electronic equipment and storage medium
CN111127191A (en) Risk assessment method and device
CN112541362A (en) Generalization processing method, device, equipment and computer storage medium
CN111984775A (en) Question and answer quality determination method, device, equipment and storage medium
CN111666771A (en) Semantic label extraction device, electronic equipment and readable storage medium of document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant