CN117574878A - Component syntactic analysis method, device and medium for mixed field - Google Patents

Component syntactic analysis method, device and medium for mixed field Download PDF

Info

Publication number
CN117574878A
CN117574878A CN202410049989.9A CN202410049989A CN117574878A CN 117574878 A CN117574878 A CN 117574878A CN 202410049989 A CN202410049989 A CN 202410049989A CN 117574878 A CN117574878 A CN 117574878A
Authority
CN
China
Prior art keywords
component
training
task
field
text sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410049989.9A
Other languages
Chinese (zh)
Other versions
CN117574878B (en
Inventor
白雪峰
张岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Westlake University
Original Assignee
Westlake University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Westlake University filed Critical Westlake University
Priority to CN202410049989.9A priority Critical patent/CN117574878B/en
Publication of CN117574878A publication Critical patent/CN117574878A/en
Application granted granted Critical
Publication of CN117574878B publication Critical patent/CN117574878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a component syntax analysis method, a device and a medium for a mixed field, wherein the component syntax analysis method comprises the following steps: training the pre-training language model based on at least one text sequence processing task associated with a component syntactic analysis task using a first training dataset comprising training data of at least a first domain and a second domain to obtain a first language model, wherein the training data in the first training dataset has truth labels for the various text sequence processing tasks; and performing supplemental training on the trained first language model based on the component syntax analysis task by using at least the component syntax annotation data of the first domain to obtain a component syntax analyzer, and performing component syntax analysis on the text sequences of the first domain and the second domain. Under the condition that part of field labeling data is lack or even missing, the component syntax analyzer can still have better field generalization capability and higher component syntax analysis accuracy.

Description

Component syntactic analysis method, device and medium for mixed field
Technical Field
The application belongs to the field of natural language processing, and particularly relates to a component syntactic analysis method, device and medium for the mixed field.
Background
Component syntactic analysis is an important task in natural language processing, whose goal is to decompose sentences into components (e.g., subjects, predicates, objects, etc.) and describe the syntactic relationship between them. Component syntactic analysis can help computers understand human language input better and play an important role in various natural language processing applications, such as machine translation, text summarization, question-answering systems, and the like.
The existing component syntactic analyzer can train a Pre-trained language model (Pre-trained Language Model) by using text data in a specific field and taking a component syntactic analysis task as a target, and then perform model Fine tuning (Fine-tune) by using manually marked data in the field, thereby obtaining a model finally used for component syntactic analysis. However, when there is less annotation data for other fields that can be used for model tuning, the field generalization of the constituent syntactic analyzer resulting from the tuning training is typically poor.
Therefore, the prior art still fails to solve the problems that the composition syntax analyzer obtained by training still has better field generalization capability and higher composition syntax analysis accuracy under the condition that the manual annotation data in a specific field is less or even absent.
Disclosure of Invention
The present application has been made to solve the above-mentioned problems occurring in the prior art.
The invention aims to provide a component syntax analysis method, a device and a medium for mixed fields, which can enable a component syntax analyzer obtained through training to still have better field generalization capability and higher component syntax analysis accuracy under the condition that manual annotation data of a specific field is less or even absent.
According to a first aspect of the present disclosure, there is provided a composition syntax analysis method for a hybrid field, including: training the pre-training language model based on at least one text sequence processing task associated with a component parsing task using a first training dataset comprising at least training data of a first domain and training data of a second domain to obtain a trained first language model, wherein the training data in the first training dataset has truth labels for various text sequence processing tasks of the at least one text sequence processing task; performing supplemental training on the trained first language model based on the component syntactic analysis task by using at least the component syntactic annotation data of the first field to obtain a component syntactic analyzer; and performing component syntactic analysis on the text sequences in the first field and the text sequences in the second field by using the component syntactic analyzer.
According to a second aspect of the present application, there is provided an apparatus for component syntax analysis of a hybrid field, the apparatus comprising a processor configured to perform the steps of the component syntax analysis method for a hybrid field according to various embodiments of the present application.
According to a third aspect of the present application, there is provided a non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by a processor, perform the steps of a composition syntax analysis method for a hybrid domain according to various embodiments of the present application.
According to the method and the device, the training data of the first field and the training data of the second field are utilized, the pre-training language model is jointly trained on a plurality of fields to fuse knowledge of the fields, so that text context representation with field generalization is obtained through a joint learning mode, and further, under the condition that manual labeling data of a specific field are fewer or even absent, the component syntax analyzer obtained through training still has good field generalization capability and high component syntax analysis accuracy.
Drawings
In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The same reference numerals with letter suffixes or different letter suffixes may represent different instances of similar components. The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.
FIG. 1 illustrates a flow chart of a component syntax analysis method for a hybrid domain according to an embodiment of the present application;
FIG. 2 illustrates a training process schematic of a constituent syntax analyzer according to an embodiment of the present application;
FIG. 3 illustrates a schematic diagram of knowledge correlation between different domains in accordance with an embodiment of the present application;
FIG. 4 illustrates a schematic diagram of a relationship between different constituent syntax analysis tasks and constituent syntax analysis according to an embodiment of the present application;
FIG. 5 shows a unified training data diagram based on serialized truth labels in accordance with an embodiment of the present application;
FIG. 6 illustrates a text encoding process schematic diagram according to an embodiment of the present application; and
fig. 7 shows a schematic diagram of constituent syntax analysis principles according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions of the present disclosure, the following detailed description of the present disclosure is provided with reference to the accompanying drawings and the specific embodiments. Embodiments of the present disclosure will be described in further detail below with reference to the drawings and specific embodiments, but not by way of limitation of the present disclosure. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.
In some embodiments of the present application, a component syntax analysis method for a hybrid domain is provided. Fig. 1 shows a flowchart of a component syntax analysis method for a hybrid domain according to an embodiment of the present application. As shown in fig. 1, the component syntax analysis method for the mixed domain includes steps S101 to S103.
In step S101, the pre-training language model is trained based on at least one text sequence processing task associated with a component parsing task using a first training data set comprising at least training data of a first domain and training data of a second domain to obtain a trained first language model. The training data in the first training data set has true value labels for various text sequence processing tasks of the at least one text sequence processing task. Each text sequence processing task has a truth-up label.
For example, the data of the task processing results of a plurality of different text sequence processing tasks can be unified into a sequence annotation form by truth annotation. The task processing results in the sequence labeling form can enable a plurality of different text sequence processing tasks to be easy to perform joint training, and meanwhile information interaction among multiple tasks is facilitated.
In step S102, the trained first language model is additionally trained based on the component syntax analysis task using at least the component syntax notation data of the first domain to obtain a component syntax analyzer.
Supplemental training is a targeted training of the first language model for specific analysis tasks, such as component syntactic analysis tasks, in order to obtain an analysis model that is more targeted to the analysis tasks.
For example, the supplemental training may be fine tuning. To be able to efficiently use multi-domain multitasking knowledge to enhance constituent syntax analysis, a "pretraining-fine tuning" two-stage framework is used to train the constituent syntax analyzer, as shown in fig. 2. Training of the component syntax analyzer is performed by fine-tuning the model with the first language model obtained in step S101 as an initialization model using component syntax annotation data, that is, truth annotation of a component syntax task.
In step S103, the text sequence of the first domain and the text sequence of the second domain are subjected to component syntax analysis by a component syntax analyzer.
According to the method and the device, the training data of the first field and the training data of the second field are utilized, the pre-training language model is jointly trained on a plurality of fields to fuse knowledge of the fields, so that text context representation with field generalization is obtained through a joint learning mode, and further, under the condition that manual labeling data of a specific field are fewer or even absent, the component syntax analyzer obtained through training still has good field generalization capability and high component syntax analysis accuracy. The specific field here may be the second field, but is not limited thereto, and may be any field having common knowledge with the training data in the training data set.
In some embodiments, the second domain has more training data than the first domain.
Specifically, when the pre-training language model is trained, training texts in the second field are more than training data in the first field, so that the first language model can learn knowledge in the second field better, and in a subsequent component syntactic analysis task, less labeling data in the second field are used, so that higher component syntactic analysis accuracy can be obtained.
In some embodiments, step S102 of the component syntax analysis method performs supplemental training on the trained first language model based on the component syntax analysis task using at least the component syntax annotation data of the first domain to obtain the component syntax analyzer further includes performing supplemental training on the trained first language model based on the component syntax analysis task using the component syntax annotation data of the first domain and the component syntax annotation data of the second domain to obtain the component syntax analyzer. Wherein the constituent syntax notation data of the second domain is less than the constituent syntax notation data of the first domain.
Specifically, the first language model and the component parser may also be obtained by training using multi-domain training data. Fig. 3 shows a schematic diagram of knowledge correlation between different domains according to an embodiment of the present application. As shown in fig. 3, there is a common knowledge between text data in different fields, and training by using text data in multiple fields is beneficial to helping neural networks of pre-training language models learn general semantic representations, so as to reduce field bias in the semantic representations, and further improve the accuracy of output results of the first language model and the component syntactic analyzer. And, use the data of second field that is less than the data of first field can realize through the training of multi-field data to improve the precision of composition syntactic analysis, be favorable to alleviateing the training burden of first language model and composition syntactic analyzer, improve training efficiency.
In some embodiments, the at least one text sequence processing task associated with the component parsing task in the component parsing method includes one or a combination of a target text sequence prediction task, a part-of-speech tag prediction task, a named entity tag prediction task, and a semantic role tag prediction task.
The component syntactic analysis task aims to enhance the semantic modeling capability of the language model by incorporating different semantic knowledge involved in the various tasks during the pre-training phase of training the pre-trained language model. Specifically, among a plurality of text sequence processing tasks such as emotion classification prediction, semantic prediction, chapter structure prediction, etc., four tasks such as target text sequence prediction, part-of-speech tag prediction, named entity tag prediction, semantic role tag prediction, etc., and tasks that tend to learn syntactic analysis are selected to perform pre-training.
As shown in fig. 4 and 5, for the part of speech tag prediction task, each word is assigned a part of speech tag, e.g. "Wang Xiaoming" is a noun; for a named entity tag prediction task, firstly, corresponding entity tags (such as 'place names') are allocated to entity words (such as 'A places') in a text, and blank tags are allocated to non-entity words; for the semantic role label prediction task, first, a core verb (such as "acquisition") is given an "action" label, an object of a relationship of the event (such as "Wang Xiaoming") is given an "event maker" label, and an object of a relationship of the event (such as "three items") is given an "event receiver" label.
And, as shown in fig. 4 and 5, for the target text sequence prediction task, the input text is subjected to random masking (i.e., randomly replaced with special characters), and the neural network is trained to predict words at corresponding positions. The other three natural language processing tasks can provide beneficial information for component syntactic analysis. The part-of-speech tag is beneficial to judging phrase types (such as 'acquisition' can form a basic component unit as a verb), the named entity tag is beneficial to processing special phrase components (such as identifying place name 'A place' is beneficial to processing text containing proprietary words), and the semantic role tag is beneficial to predicting structures among components (such as identifying semantic relations between 'acquisition' and 'A place' can help to predict verb phrase relations between 'acquisition' and 'three items').
In the case where at least one text sequence processing task associated with the component parsing task is plural, the pre-training language model is jointly trained with a weighted sum of optimization objectives corresponding to the various text sequence processing tasks as an overall optimization objective.
Through fusing multi-task and multi-domain knowledge into the neural network in the pre-training stage, the neural network can have stronger natural language syntactic modeling capability, so that the accuracy and generalization capability of the model in the component syntactic analysis task are improved. In order to fuse the knowledge in multiple fields, the invention provides that the combined pre-training is carried out on multiple tasks, and the text context representation with the generalization of the fields is obtained in a combined training mode.
Illustratively, the above four tasks are simultaneously optimized using a joint training approach, overall optimization goalsThe method comprises the following steps:
… … (1)
Wherein,is the optimization target of the target text sequence prediction task, < +.>Is the optimization target of part-of-speech tag prediction task, < - > for>Is a task of predicting named entity tagsOptimization objective, & lt>Is the optimization target of semantic role label prediction task, +.>、/>And->Is a super parameter. The superparameter is used to control the important ratio between different tasks. During training, a gradient descent method may be used to optimize the neural network.
In some embodiments, the target text sequence prediction task generates a target text sequence that meets the user's requirements based on the training text sequence. Specifically, in the target text sequence prediction task, the input text is givenWherein->Representing->Personal word (s)/(s)>Representing the total length of the text. First randomly add the i-th word in the text>Random substitution to "[ MASK ]]"character masking process, target text sequence prediction task trains neural network to recover original text according to masked text.
Assume that the masked text is represented asThe optimization objective of the objective text sequence prediction task +.>Is to maximize the likelihood of the following text:
… … (2)
Wherein,predictive training data representing all target text sequences, +.>Representing text +.>Predictive original text->Conditional probability of (2).
As shown in fig. 5, the part-of-speech tag prediction task generates a corresponding part-of-speech tag sequence based on the training text sequence. Specifically, in the part-of-speech tag prediction task, given input textPart-of-speech tag prediction tasks aim to predict the part-of-speech tag sequence corresponding to the input text sequence +.>Wherein->Representing the part-of-speech tag corresponding to the i-th word in the text. Optimization objective of part-of-speech tag prediction task +.>Is to maximize the following likelihood values:
… … (3)
Wherein,representing all part-of-speech tag prediction data, +.>Representative to input text +.>The conditional probability of the part-of-speech tag sequence is predicted.
As shown in FIG. 5, the named entity tag prediction task generates a corresponding named entity tag sequence based on the training text sequence. Specifically, in a named entity tag prediction task, given input textNamed entity tag prediction task aims at predicting the part-of-speech tag sequence corresponding to the input text sequence +.>Wherein->Representing the named entity tag corresponding to the i-th word in the text. Optimization objective for named entity tag prediction taskIs to maximize the following likelihood values:
… … (4)
Wherein,representing all named entity tag prediction data, +.>Representative to input text +.>Predicted named entity tag sequence->Conditional probability of (2).
As shown in fig. 5, the semantic role label prediction task generates a corresponding semantic role label sequence based on the training text sequence. Specifically, in a semantic role label prediction task, given input textSemantic role label prediction task aims at predicting part-of-speech label sequences corresponding to an input text sequence +.>WhereinRepresenting +.>Semantic role labels corresponding to individual words. Optimization objective of semantic role label prediction task +.>Is to maximize the following likelihood values:
… … (5)
Wherein,representing all semantic role label prediction data, +.>Representative to input text +.>Predictive semantic role tag sequence +.>Conditional probability of (2).
The unified data processing mode can unify result data of three different tasks, namely target text sequence prediction, part-of-speech tag prediction, named entity tag prediction and semantic role tag prediction, into a sequence labeling mode, so that joint training can be easily carried out among a plurality of different text sequence processing tasks, information interaction among multiple tasks is facilitated, further, the performance of a syntactic analysis task can be enhanced by effectively utilizing knowledge contained in other related tasks through joint training, texts in multiple fields can be utilized to participate in the syntactic analysis related tasks, and therefore general knowledge among the fields can be better transferred to a component syntactic analyzer, and the cross-field generalization capability of a model is improved.
In some embodiments, the constituent syntax analysis method further comprises preprocessing the training data in the first training data set such that each training data has the same sequence length for truth labels of various text sequence processing tasks. The data of the task processing results of the plurality of different text sequence processing tasks are unified into the sequence labeling form with the same sequence length through the true value labeling, so that the plurality of different text sequence processing tasks can be more easily subjected to joint training, and meanwhile, the information interaction among the plurality of tasks is more facilitated.
In some embodiments, as shown in fig. 6, in the component syntax analysis method, the training data in the first training data set is obtained by text encoding the pre-training data by a transducer encoder model. The pre-training data comprises at least pre-training data of a first domain. The encoder model based on the transducer is used as a text encoder, takes texts in multiple fields as input, and outputs the encoded textFor subsequent training of the component parsing task.
In some embodiments, the constituent syntax analysis task generates a corresponding constituent syntax tag sequence based on the training text sequence.
Specifically, in the pre-training stage of training the pre-training language model, an initialized neural network model is given, then training data which unifies the labeling form of the output result sequence of the first language model is constructed, and a pre-training framework which fuses multi-field multi-task knowledge is used for guiding the training of the neural network on the basis. Finally, the model obtained in the pre-training stage will be used for additional training for component parsing tasks.
In a supplemental training phase, e.g. a fine tuning phase, it is assumed that the text sequence is enteredThe corresponding constituent syntax tree may be represented as a triplet set: />Wherein->Represents->Left border of the segment>Represents->Right border of the segment>Represents->Tags of the fragments. The component syntactic analysis task aims at learning a slave +.>To->Mapping of->From the deliveryThe text is entered to predict its underlying constituent syntactic knowledge.
In order to keep consistent with the input and output forms of the first language model, the input and output of the component syntactic analysis task are converted into sequence labeling forms. Specifically, as shown in FIG. 7, a text sequence is enteredAssigning a two-tuple tag to each word in (1) to obtain an input text sequence +.>Component syntactic tag sequence->. Wherein the first digit of the tuple represents the current word +.>And the next word->The number of common ancestors in the syntax tree, the second digit of the tuple representing the current word +.>And the next word->The most recent common ancestor in the syntax tree. Exemplary, the current word->Is "A ground", the current word->Next word of (a)For "acquisition", "A-land" and "acquisition" the number of common ancestors in the composition syntax tree is 2, the most recent common ancestor is VP, and the "A-land" tuple tag is (2, VP).
Assume as initializationThe component syntactic label sequence of the input of the model isThe constituent syntactic analysis task aims to optimize the likelihood:
… … (6)
Wherein,representing an optimization objective of a component syntactic analysis task, +.>The data is parsed on behalf of all components,representative to input text +.>Predicted component syntax tag sequence->Conditional probability of (2).
In the reasoning process, a predicted constituent syntactic tag sequence is givenThe present invention uses a rule-based approach to reduce a composition syntax tree based on the common ancestors of the current word and the previous word and the number of common ancestors.
In some embodiments of the present application, an apparatus for component syntax analysis of a hybrid domain is provided, the apparatus comprising a processor configured to perform the steps of the component syntax analysis method for a hybrid domain according to various embodiments of the present application.
A processor may be a processing device that includes one or more general purpose processing devices, such as a microprocessor, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and the like. More specifically, the processor may be a complex instruction set operation (CISC) microprocessor, a reduced instruction set operation (RISC) microprocessor, a Very Long Instruction Word (VLIW) microprocessor, a processor running other instruction sets, or a processor running a combination of instruction sets. A processor may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like.
In some embodiments of the present application, a non-transitory computer readable medium is provided having instructions stored thereon, wherein the instructions, when executed by a processor, perform the steps of a composition syntax analysis method for a hybrid domain according to various embodiments of the present application.
In particular, the processor may be communicatively coupled to a computer and configured to execute computer-executable instructions stored in a non-transitory computer-readable medium. Non-transitory computer readable media are, for example, memory, which may include read-only memory (ROM), random-access memory (RAM), phase-change random-access memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), electrically erasable programmable read-only memory (EEPROM), other types of random-access memory (RAM), flash memory disk or other forms of flash memory, cache, registers, static memory, compact disc read-only memory (CD-ROM), digital Versatile Discs (DVD) or other optical memory, magnetic cassettes, or other magnetic storage devices, and the like. In some embodiments, the memory may store computer-executable instructions and data used or generated when the computer-executable instructions are executed. The processor may execute computer-executable instructions to implement a graph-based natural language text generation method for big data analysis according to various embodiments of the present application. According to the method, the vector representation of the graph is modeled through the graph traversal sequence conversion task, the graph vector representation space and the text vector representation space are formed, model optimization is carried out in a multitask training mode, and further graph structure prediction is carried out through the training neural network, so that the understanding capability of the neural network on the graph can be enhanced, and the text which is more consistent with the graph in terms of semantics can be generated. In addition, the method can effectively model the sub-graph level context information of the graph through the sub-graph completion task, so that the large-scale unlabeled graph data can be effectively utilized, the neural network can learn the general features of the graph better, and the quality of text generation can be improved.
In the component syntactic analysis method, the device and the medium for the mixed field, the multitasking and multi-field knowledge is fused into the neural network of the first language model through the pre-training of training the pre-training language model, so that the neural network of the first language model has stronger natural language syntactic modeling capability, and the accuracy and generalization capability of the first language model on the component syntactic analysis task are improved. And the data of the task processing results of a plurality of different text sequence processing tasks are unified into a sequence labeling form by a data processing method of sequence labeling, so that the problem of output form difference among multiple tasks is solved, the multiple tasks can be pretrained, and the learning of component syntactic analysis is facilitated. Moreover, knowledge of multiple domains is fused by means of multi-domain joint training to obtain a text context representation with domain generalization.
Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across schemes), adaptations or alterations based on the present disclosure. Elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the present application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the disclosure. This is not to be interpreted as an intention that the disclosed features not being claimed are essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (10)

1. A composition syntax analysis method for a hybrid field, comprising:
training the pre-training language model based on at least one text sequence processing task associated with a component parsing task using a first training dataset comprising at least training data of a first domain and training data of a second domain to obtain a trained first language model, wherein the training data in the first training dataset has truth labels for various text sequence processing tasks of the at least one text sequence processing task;
performing supplemental training on the trained first language model based on the component syntactic analysis task by using at least the component syntactic annotation data of the first field to obtain a component syntactic analyzer;
and performing component syntactic analysis on the text sequences in the first field and the text sequences in the second field by using the component syntactic analyzer.
2. The component parsing method of claim 1 wherein the second domain has more training data than the first domain.
3. The component parsing method of claim 1, wherein the performing supplemental training on the trained first language model based on the component parsing task using at least the component syntax annotation data of the first domain to obtain the component syntax analyzer further comprises:
and performing supplemental training on the trained first language model based on the component syntactic analysis task by utilizing component syntactic annotation data of the first field and component syntactic annotation data of the second field to obtain a component syntactic analyzer, wherein the number of the component syntactic annotation data of the second field is smaller than that of the component syntactic annotation data of the first field.
4. The component parsing method of any one of claims 1-3, wherein the at least one text sequence processing task associated with a component parsing task includes one or a combination of a target text sequence prediction task, a part-of-speech tag prediction task, a named entity tag prediction task, and a semantic role tag prediction task;
and under the condition that the at least one text sequence processing task associated with the component syntactic analysis task is multiple, performing joint training on the pre-training language model by taking the weighted sum of the optimization targets corresponding to the various text sequence processing tasks as an overall optimization target.
5. The component syntax analysis method according to claim 4, wherein,
the target text sequence prediction task generates a target text sequence meeting the user requirement based on the training text sequence;
the part-of-speech tag prediction task generates a corresponding part-of-speech tag sequence based on the training text sequence;
the named entity tag prediction task generates a corresponding named entity tag sequence based on the training text sequence;
the semantic role label prediction task generates a corresponding semantic role label sequence based on the training text sequence.
6. The component syntax analysis method according to any one of claims 1-3, wherein said component syntax analysis method further comprises:
and preprocessing the training data in the first training data set so that true value labels of all the training data for various text sequence processing tasks have the same sequence length.
7. The component syntax analysis method according to claim 1 or 2, wherein,
the training data in the first training data set is obtained by text encoding of pre-training data through a transducer encoder model, and the pre-training data at least comprises pre-training data in the first field.
8. The component syntax analysis method according to claim 1 or 3, wherein,
the component syntax analysis task generates a corresponding component syntax tag sequence based on the training text sequence.
9. An apparatus for component syntax analysis in a hybrid field, the apparatus comprising a processor configured to perform the steps of the component syntax analysis method for a hybrid field of any one of claims 1-8.
10. A non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by a processor, perform the steps of the component syntax analysis method for a hybrid field of any of claims 1-8.
CN202410049989.9A 2024-01-15 2024-01-15 Component syntactic analysis method, device and medium for mixed field Active CN117574878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410049989.9A CN117574878B (en) 2024-01-15 2024-01-15 Component syntactic analysis method, device and medium for mixed field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410049989.9A CN117574878B (en) 2024-01-15 2024-01-15 Component syntactic analysis method, device and medium for mixed field

Publications (2)

Publication Number Publication Date
CN117574878A true CN117574878A (en) 2024-02-20
CN117574878B CN117574878B (en) 2024-05-17

Family

ID=89895776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410049989.9A Active CN117574878B (en) 2024-01-15 2024-01-15 Component syntactic analysis method, device and medium for mixed field

Country Status (1)

Country Link
CN (1) CN117574878B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070078643A1 (en) * 2003-11-25 2007-04-05 Sedogbo Celestin Method for formation of domain-specific grammar from subspecified grammar
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
CN107832476A (en) * 2017-12-01 2018-03-23 北京百度网讯科技有限公司 A kind of understanding method of search sequence, device, equipment and storage medium
US20180329892A1 (en) * 2017-05-02 2018-11-15 Dassault Systemes Captioning a region of an image
CN110263324A (en) * 2019-05-16 2019-09-20 华为技术有限公司 Text handling method, model training method and device
JP2019192247A (en) * 2018-04-20 2019-10-31 株式会社Nttドコモ Sentence labeling method and sentence labeling device
WO2020107765A1 (en) * 2018-11-30 2020-06-04 深圳前海微众银行股份有限公司 Statement analysis processing method, apparatus and device, and computer-readable storage medium
CN111368548A (en) * 2018-12-07 2020-07-03 北京京东尚科信息技术有限公司 Semantic recognition method and device, electronic equipment and computer-readable storage medium
CN112989796A (en) * 2021-03-10 2021-06-18 北京大学 Text named entity information identification method based on syntactic guidance
CN114626463A (en) * 2022-03-16 2022-06-14 腾讯科技(深圳)有限公司 Language model training method, text matching method and related device
CN115034224A (en) * 2022-01-26 2022-09-09 华东师范大学 News event detection method and system integrating representation of multiple text semantic structure diagrams
CN117195922A (en) * 2023-11-07 2023-12-08 四川语言桥信息技术有限公司 Human-in-loop neural machine translation method, system and readable storage medium
CN117252261A (en) * 2023-09-25 2023-12-19 深圳前海微众银行股份有限公司 Knowledge graph construction method, electronic equipment and storage medium
CN117312564A (en) * 2023-10-17 2023-12-29 中国电信股份有限公司技术创新中心 Text classification method, classification device, electronic equipment and storage medium
CN117390189A (en) * 2023-11-05 2024-01-12 北京工业大学 Neutral text generation method based on pre-classifier

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070078643A1 (en) * 2003-11-25 2007-04-05 Sedogbo Celestin Method for formation of domain-specific grammar from subspecified grammar
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
US20180329892A1 (en) * 2017-05-02 2018-11-15 Dassault Systemes Captioning a region of an image
CN107832476A (en) * 2017-12-01 2018-03-23 北京百度网讯科技有限公司 A kind of understanding method of search sequence, device, equipment and storage medium
JP2019192247A (en) * 2018-04-20 2019-10-31 株式会社Nttドコモ Sentence labeling method and sentence labeling device
WO2020107765A1 (en) * 2018-11-30 2020-06-04 深圳前海微众银行股份有限公司 Statement analysis processing method, apparatus and device, and computer-readable storage medium
CN111368548A (en) * 2018-12-07 2020-07-03 北京京东尚科信息技术有限公司 Semantic recognition method and device, electronic equipment and computer-readable storage medium
CN110263324A (en) * 2019-05-16 2019-09-20 华为技术有限公司 Text handling method, model training method and device
CN112989796A (en) * 2021-03-10 2021-06-18 北京大学 Text named entity information identification method based on syntactic guidance
CN115034224A (en) * 2022-01-26 2022-09-09 华东师范大学 News event detection method and system integrating representation of multiple text semantic structure diagrams
CN114626463A (en) * 2022-03-16 2022-06-14 腾讯科技(深圳)有限公司 Language model training method, text matching method and related device
CN117252261A (en) * 2023-09-25 2023-12-19 深圳前海微众银行股份有限公司 Knowledge graph construction method, electronic equipment and storage medium
CN117312564A (en) * 2023-10-17 2023-12-29 中国电信股份有限公司技术创新中心 Text classification method, classification device, electronic equipment and storage medium
CN117390189A (en) * 2023-11-05 2024-01-12 北京工业大学 Neutral text generation method based on pre-classifier
CN117195922A (en) * 2023-11-07 2023-12-08 四川语言桥信息技术有限公司 Human-in-loop neural machine translation method, system and readable storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
XUEFENG BAI ET AL.: "Investigating Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network", IEEE, vol. 29, 2 December 2020 (2020-12-02), XP011829612, DOI: 10.1109/TASLP.2020.3042009 *
张晓孪;王西锋;: "中文问答系统中语义角色标注的研究与实现", 科学技术与工程, no. 10, 15 May 2008 (2008-05-15) *
李业刚;孙福振;李鉴柏;吕新宇;: "语义角色标注研究综述", 山东理工大学学报(自然科学版), no. 06, 25 November 2011 (2011-11-25) *
李雁群;何云琪;钱龙华;周国栋;: "基于维基百科的中文嵌套命名实体识别语料库自动构建", 计算机工程, no. 11, 15 November 2018 (2018-11-15) *
石岳峰等: "深度学习在论辩挖掘任务中的应用", 中文信息学报, vol. 36, no. 7, 31 July 2022 (2022-07-31) *
黄子怡等: "基于联合学习的成分句法与AMR 语义分析方法", 中文信息学报, vol. 36, no. 7, 31 July 2022 (2022-07-31) *

Also Published As

Publication number Publication date
CN117574878B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
Lee et al. Fully character-level neural machine translation without explicit segmentation
CN113254610B (en) Multi-round conversation generation method for patent consultation
Pramanik et al. Text normalization using memory augmented neural networks
Xu et al. NADAQ: natural language database querying based on deep learning
Goyal et al. Natural language generation through character-based rnns with finite-state prior knowledge
CN114936287A (en) Knowledge injection method for pre-training language model and corresponding interactive system
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
Sabane et al. Enhancing Low Resource NER using Assisting Language and Transfer Learning
He et al. Infrrd. ai at SemEval-2022 Task 11: A system for named entity recognition using data augmentation, transformer-based sequence labeling model, and EnsembleCRF
Dong et al. Relational distance and document-level contrastive pre-training based relation extraction model
CN117574878B (en) Component syntactic analysis method, device and medium for mixed field
Hu et al. RST discourse parsing as text-to-text generation
Ahammad et al. Improved neural machine translation using Natural Language Processing (NLP)
Xu et al. A Multi-Task Instruction with Chain of Thought Prompting Generative Framework for Few-Shot Named Entity Recognition
CN117576710B (en) Method and device for generating natural language text based on graph for big data analysis
Cao et al. Predict, pretrained, select and answer: Interpretable and scalable complex question answering over knowledge bases
Yu et al. Adaptive cross-lingual question generation with minimal resources
Tank et al. Abstractive text summarization using adversarial learning and deep neural network
Gong Analysis and Application of the Business English Translation Query and Decision Model with Big Data Corpus
CN112966520B (en) Natural language generation method and device
CN116245114B (en) End-to-end task type dialogue system based on dialogue state guidance
Zhou et al. Increasing naturalness of human–machine dialogue: The users’ choices inference of options in machine-raised questions
Shen et al. Medical Text Entity Study based on BERT-BiLSTM-MHA-CRF Model
Wu et al. Optimization of hierarchical reinforcement learning relationship extraction model
Chiplunkar et al. Prediction of pos tagging for unknown words for specific Hindi and Marathi language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant