CN112131873A - Part-of-speech tagging method and device for text - Google Patents
Part-of-speech tagging method and device for text Download PDFInfo
- Publication number
- CN112131873A CN112131873A CN202011063051.0A CN202011063051A CN112131873A CN 112131873 A CN112131873 A CN 112131873A CN 202011063051 A CN202011063051 A CN 202011063051A CN 112131873 A CN112131873 A CN 112131873A
- Authority
- CN
- China
- Prior art keywords
- text
- labeled
- speech tagging
- speech
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 80
- 238000002372 labelling Methods 0.000 claims abstract description 57
- 238000012545 processing Methods 0.000 claims abstract description 44
- 238000007781 pre-processing Methods 0.000 claims abstract description 23
- 238000012360 testing method Methods 0.000 claims description 32
- 238000012549 training Methods 0.000 claims description 22
- 238000012795 verification Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a part of speech tagging method and a part of speech tagging device for a text, wherein the method comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in a text set to be labeled belongs, and determining a preset part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a part-of-speech tagging method and device for a text.
Background
With the development of science and technology, natural language processing has been developed, and natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that can achieve efficient communication between people and computers using natural language, and parts of speech tagging is often required for texts in the process of executing various complex natural language tasks, such as in the field of intelligent question answering.
Part-of-speech tagging refers to tagging part-of-speech tags on each element of a text to be processed, in the prior art, text characteristics are usually formulated manually during part-of-speech tagging, tag distribution is performed by identifying the characteristics of the text, manual customization of the text characteristics easily causes inaccurate feature description, and therefore the accuracy of part-of-speech tagging is low.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a part-of-speech tagging method for a text, which can improve the accuracy of part-of-speech tagging.
The invention also provides a part-of-speech tagging device of the text, which is used for ensuring the realization and the application of the method in practice.
A part-of-speech tagging method for text comprises the following steps:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
Optionally, the method includes a process of setting the part-of-speech tagging model, including:
acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set;
sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model;
and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
Optionally, in the method, the preprocessing is performed on each text to be labeled in the text set to be labeled, and includes:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
Optionally, the method includes that each preprocessed text to be labeled is sequentially input into a preset part-of-speech labeling model, and a part-of-speech labeling result corresponding to each preprocessed text is obtained, where the method includes:
inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input to a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
Optionally, the determining a corpus field to which a text to be processed in the text set to be labeled belongs includes:
acquiring text attribute information of the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
A part-of-speech tagging apparatus for text, comprising:
the receiving unit is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received;
the acquiring unit is used for acquiring a text set to be labeled specified by the text processing instruction under the condition that the task type is a part-of-speech labeling task, wherein the text set to be labeled comprises at least one text to be labeled;
the determining unit is used for determining the corpus field to which the text to be processed in the text set to be labeled belongs and determining a part-of-speech labeling model corresponding to the corpus field;
the preprocessing unit is used for preprocessing each text to be labeled in the text set to be labeled;
and the labeling unit is used for sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
The above apparatus, optionally, further comprises: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
The above apparatus, optionally, the preprocessing unit includes:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
The above apparatus, optionally, the labeling unit includes:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
The above apparatus, optionally, the determining unit includes:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
Compared with the prior art, the invention has the following advantages:
the invention provides a part of speech tagging method and a part of speech tagging device for a text, wherein the method comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for part-of-speech tagging of a text according to the present invention;
FIG. 2 is a flowchart of a part-of-speech tagging model setting process provided by the present invention;
FIG. 3 is a flowchart illustrating a process of processing a preprocessed text to be labeled according to the present invention;
FIG. 4 is an exemplary diagram of an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a part-of-speech tagging apparatus for text according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The embodiment of the invention provides a part-of-speech tagging method for a text, which can be applied to various system platforms, wherein an execution main body of the method can be a computer terminal or a processor of various mobile devices, and a flow chart of the method is shown in fig. 1 and specifically comprises the following steps:
s101: when a text processing instruction is received, determining a task type corresponding to the text processing instruction.
In the method provided by the embodiment of the present invention, the task type may be various natural language processing task types, for example: the system comprises a text classification task, an entity recognition task, a part of speech tagging task or a sentence relation judgment task and the like, wherein different tasks correspond to different task processing operations.
S102: and under the condition that the task is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged.
In the method provided by the embodiment of the present invention, the text to be labeled includes one or more texts to be labeled.
The instruction information of the text processing instruction may include the text set to be labeled or a set identifier of the text set to be labeled, and storage address information of the text set to be labeled; the text set to be annotated specified by the text processing instruction can be obtained based on the set identifier or the storage address information.
The text to be labeled can be texts in various languages, such as a chinese text, an english text, and the like.
S103: and determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field and set in advance.
In the method provided by the embodiment of the present invention, each to-be-processed text in the to-be-labeled text set has its corresponding corpus field, and the corpus fields to which each to-be-processed text belongs may be the same or different.
The language material field can be the language material field of each industry such as the power grid language material field, the e-commerce language material field and the communication language material field.
Specifically, part-of-speech tagging models corresponding to the corpus fields may be determined in pre-established part-of-speech tagging models, and if the corpus fields to which the texts to be processed belong are different, part-of-speech tagging models corresponding to the corpus fields may be respectively determined.
S104: and preprocessing each text to be labeled in the text set to be labeled.
In the method provided by the embodiment of the invention, the text to be annotated is preprocessed to obtain the preprocessed text to be annotated, and the preprocessed text to be annotated is matched with the input interface of the part-of-speech annotation model.
The preprocessed text to be labeled can be in a vector form.
S105: and sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
In the method provided by the embodiment of the present invention, the part-of-speech tagging model may be various natural language processing models, for example, a BERT model.
In the embodiment of the present invention, part-of-speech tagging models corresponding to different corpus fields are different, and under the condition that the corpus fields of the texts to be tagged are different, the texts to be tagged can be sequentially input into the part-of-speech tagging models corresponding to the respective corpus fields to which the texts to be tagged belong, so as to obtain part-of-speech tagging results corresponding to each text to be tagged.
Specifically, the part-of-speech tagging result is that each element of the text to be processed carries part-of-speech information, for example, the text to be processed is: "three zhang pays the electricity fee", the part of speech tagging result may be: "Zhang San n, Payment v, electric charge n", n can represent nouns, v can represent verbs.
The embodiment of the invention provides a part-of-speech tagging method for a text, which comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the embodiment of the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the setting process of the part-of-speech tagging model, as shown in fig. 2, includes:
s201: acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set.
In the method provided by the embodiment of the invention, part-of-speech tagging can be performed on the text data by using the part-of-speech tagging tool, and a technician reviews tagging results of the part-of-speech tagging tool and modifies tagging results which do not meet requirements.
Each sample data in the sample data set can be divided into a training sample set, a verification sample set and a test sample set according to a preset proportion.
S202: and training the initial part of speech tagging model in sequence by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model.
The method comprises the steps of verifying an initial part-of-speech tagging model which is trained to determine that model parameters of the initial part-of-speech tagging model meet requirements, and taking the initial part-of-speech tagging model as an alternative part-of-speech tagging model under the condition that the model parameters of the initial part-of-speech tagging model meet the requirements.
S203: and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
And adjusting parameters of the candidate part-of-speech tagging model under the condition that the part-of-speech tagging accuracy rate is not greater than the accuracy rate threshold value until the part-of-speech tagging accuracy rate of the candidate part-of-speech tagging model is greater than the accuracy rate threshold value.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the preprocessing each text to be labeled in the text set to be labeled includes:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
Each text to be labeled in the text set to be labeled can be split by adopting a preset word segmentation method to obtain each text block of each text to be labeled, and the text blocks can be single words or words.
In the method provided by the embodiment of the present invention, the text blocks in the text to be processed may be mapped to obtain a dictionary index set corresponding to the text to be processed, where the dictionary index set includes dictionary indexes of the text blocks of the text to be processed, and the dictionary index set may be used as an input of a model; if the length of the text to be processed is greater than the preset length threshold, the head and/or the tail of the text to be processed can be randomly removed, so that the length of the text to be processed is equal to the length threshold, and if the length of the text to be processed is smaller than the length threshold, the text to be processed can be subjected to the completion operation, so that the length of the text to be processed subjected to the completion operation is equal to the length threshold.
Specifically, each text block may be converted into a dictionary index (i.e., id) as a model input by pre-training the model BERT's own dictionary vocab.
In the method provided in the embodiment of the present invention, based on the foregoing implementation process, specifically, the processing each preprocessed text to be annotated by applying a preset part-of-speech tagging model is shown in fig. 3, and includes:
s301: inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; and the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders.
In the method provided by the embodiment of the present invention, the encoder may be a bidirectional transforms encoder.
S302: triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to the to-be-processed texts, wherein each preprocessed to-be-tagged text is input of a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
In the method provided by the embodiment of the present invention, the output of the last encoder in the part-of-speech tagging model is the part-of-speech tagging result corresponding to the text to be processed, and the next preprocessed text to be processed can be input to the part-of-speech tagging model under the condition that the part-of-speech result corresponding to the text to be processed currently input to the part-of-speech tagging model is obtained.
In the method provided by the embodiment of the present invention, based on the implementation process, specifically, the determining, in each text to be processed in a text set to be labeled, a corpus field to which the text to be processed in the text set to be labeled belongs includes:
acquiring text attribute information in the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
In the method provided by the embodiment of the present invention, the corpus field to which the text to be processed in the file set to be labeled belongs may be determined in a preset configuration file based on the text attribute information, and the configuration file records the corresponding relationship between the text attribute information and the corpus field.
Referring to fig. 4, an exemplary diagram of an implementation scenario provided by the present invention is shown, where the implementation scenario provided by the embodiment of the present invention includes a server 401 and a terminal 402.
In practice, the terminal 402 shown in fig. 4 may be an electronic device such as a mobile phone, a tablet computer, a personal computer, and the like. The server 401 may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center. The server 401 and the terminal 402 establish a communication connection through a network.
Before the implementation of the embodiment of the present invention, preparation work may be performed, and the preparation work includes: A. and (5) corpus labeling in the power grid field. B. Load BERT model and fine tune. C. Model deployment and application.
Wherein, the process of the corpus labeling in the power grid field can be as follows: collecting and sorting text corpora related to the power grid field; and through a marking tool, the text is manually marked by word segmentation and part of speech.
The process of loading the BERT model and fine tuning may be: loading the BERT model with the source opened; processing the labeled corpus to enable the labeled corpus to be adaptive to an interface of the model; dividing data into a training set, a verification set and a test set; performing fine tuning training by using a CPU/GPU; and carrying out structured storage on the final model.
The process of model deployment and application may be: and deploying the model and opening a relevant interface for a user to call.
Specifically, the model may be deployed in the server 401, and the terminal 402 may send a text processing instruction to the server by calling an interface, where when the server 401 receives the text processing instruction, the task type corresponding to the text processing instruction is determined; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by a text processing instruction; determining a corpus field to which a text to be processed in a text set to be labeled belongs, and if the corpus field is the electric network corpus field, preprocessing each text to be labeled in the text set to be labeled by taking the deployed model as a part-of-speech labeling model corresponding to the text processing instruction; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be processed.
Embodiments of the present invention relate to networks that are media providing communication links and may include various types of connections, such as wire, wireless communication links, or fiber optic cables.
Corresponding to the method described in fig. 1, an embodiment of the present invention further provides a part-of-speech tagging apparatus for a text, which is used for specifically implementing the method in fig. 1, and the part-of-speech tagging apparatus for a text provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and a schematic structural diagram of the apparatus is shown in fig. 5, and specifically includes:
a receiving unit 501, configured to determine a task type corresponding to a text processing instruction when the text processing instruction is received;
an obtaining unit 502, configured to obtain a to-be-labeled text set specified by the text processing instruction when the task type is a part-of-speech labeling task, where the to-be-labeled text set includes at least one to-be-labeled text;
a determining unit 503, configured to determine a corpus field to which a text to be processed in the text set to be labeled belongs, and determine a part-of-speech labeling model corresponding to the corpus field;
a preprocessing unit 504, configured to preprocess each text to be labeled in the text set to be labeled;
and the labeling unit 505 is configured to sequentially input each preprocessed text to be labeled into a preset part-of-speech labeling model, so as to obtain a part-of-speech labeling result corresponding to each text to be processed.
The embodiment of the invention provides a part-of-speech tagging device for a text, which is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the device provided by the embodiment of the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the part-of-speech tagging apparatus for text further includes: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the preprocessing unit 504 includes:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the labeling unit 505 includes:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the determining unit 503 includes:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
The specific principle and the implementation process of each unit and each module in the text part-of-speech tagging apparatus disclosed in the embodiment of the present invention are the same as those of the text part-of-speech tagging method disclosed in the embodiment of the present invention, and reference may be made to corresponding parts in the text part-of-speech tagging method provided in the embodiment of the present invention, which are not described herein again.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the equipment where the storage medium is located is controlled to execute the part-of-speech tagging method of the text.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 6, which specifically includes a memory 601 and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601 and configured to be executed by one or more processors 603 to perform the following operations on the one or more instructions 602:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The part-of-speech tagging method of the text provided by the invention is described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A part-of-speech tagging method for a text, comprising:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
2. The method according to claim 1, wherein the setting process of the part-of-speech tagging model comprises:
acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set;
sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model;
and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
3. The method according to claim 1, wherein the preprocessing each text to be labeled in the text set to be labeled comprises:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
4. The method of claim 1, wherein sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled comprises:
inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input to a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
5. The method according to claim 1, wherein the determining a corpus area to which the text to be processed in the text set to be labeled belongs comprises:
acquiring text attribute information of the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
6. A part-of-speech tagging apparatus for text, comprising:
the receiving unit is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received;
the acquiring unit is used for acquiring a text set to be labeled specified by the text processing instruction under the condition that the task type is a part-of-speech labeling task, wherein the text set to be labeled comprises at least one text to be labeled;
the determining unit is used for determining the corpus field to which the text to be processed in the text set to be labeled belongs and determining a part-of-speech labeling model corresponding to the corpus field;
the preprocessing unit is used for preprocessing each text to be labeled in the text set to be labeled;
and the labeling unit is used for sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
7. The apparatus of claim 6, further comprising: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
8. The apparatus of claim 6, wherein the pre-processing unit comprises:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
9. The apparatus of claim 6, wherein the labeling unit comprises:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
10. The apparatus of claim 6, wherein the determining unit comprises:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011063051.0A CN112131873A (en) | 2020-09-30 | 2020-09-30 | Part-of-speech tagging method and device for text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011063051.0A CN112131873A (en) | 2020-09-30 | 2020-09-30 | Part-of-speech tagging method and device for text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112131873A true CN112131873A (en) | 2020-12-25 |
Family
ID=73843616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011063051.0A Pending CN112131873A (en) | 2020-09-30 | 2020-09-30 | Part-of-speech tagging method and device for text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112131873A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844476A (en) * | 2017-10-19 | 2018-03-27 | 广州索答信息科技有限公司 | A kind of part-of-speech tagging method of enhancing |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
-
2020
- 2020-09-30 CN CN202011063051.0A patent/CN112131873A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844476A (en) * | 2017-10-19 | 2018-03-27 | 广州索答信息科技有限公司 | A kind of part-of-speech tagging method of enhancing |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107491534B (en) | Information processing method and device | |
CN110444198B (en) | Retrieval method, retrieval device, computer equipment and storage medium | |
CN109284399B (en) | Similarity prediction model training method and device and computer readable storage medium | |
CN107437417B (en) | Voice data enhancement method and device based on recurrent neural network voice recognition | |
CN111274815A (en) | Method and device for mining entity attention points in text | |
CN110019742B (en) | Method and device for processing information | |
JP2020030408A (en) | Method, apparatus, device and medium for identifying key phrase in audio | |
US20220358292A1 (en) | Method and apparatus for recognizing entity, electronic device and storage medium | |
CN111177350A (en) | Method, device and system for forming dialect of intelligent voice robot | |
CN113239204A (en) | Text classification method and device, electronic equipment and computer-readable storage medium | |
WO2023045186A1 (en) | Intention recognition method and apparatus, and electronic device and storage medium | |
CN113051895A (en) | Method, apparatus, electronic device, medium, and program product for speech recognition | |
CN110807097A (en) | Method and device for analyzing data | |
CN116701604A (en) | Question and answer corpus construction method and device, question and answer method, equipment and medium | |
US20230052906A1 (en) | Entity Recognition Method and Apparatus, and Computer Program Product | |
CN114118049B (en) | Information acquisition method, device, electronic equipment and storage medium | |
CN114528851B (en) | Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium | |
CN113779202B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN112131873A (en) | Part-of-speech tagging method and device for text | |
CN112925889B (en) | Natural language processing method, device, electronic equipment and storage medium | |
CN112732423B (en) | Process migration method, device, equipment and medium | |
CN112101003B (en) | Sentence text segmentation method, device and equipment and computer readable storage medium | |
CN114691716A (en) | SQL statement conversion method, device, equipment and computer readable storage medium | |
CN113806230A (en) | Software testing method, device, equipment and medium based on case voice | |
CN110083807B (en) | Contract modification influence automatic prediction method, device, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |