CN112131873A - Part-of-speech tagging method and device for text - Google Patents

Part-of-speech tagging method and device for text Download PDF

Info

Publication number
CN112131873A
CN112131873A CN202011063051.0A CN202011063051A CN112131873A CN 112131873 A CN112131873 A CN 112131873A CN 202011063051 A CN202011063051 A CN 202011063051A CN 112131873 A CN112131873 A CN 112131873A
Authority
CN
China
Prior art keywords
text
labeled
speech tagging
speech
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011063051.0A
Other languages
Chinese (zh)
Inventor
陈政波
卢文达
周洋
王剑
冯烛明
冯珺
包迅格
靖稳峰
孙嘉伟
刘宏
胡辉
苟蛟龙
郭刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meritdata Technology Co ltd
Xian Jiaotong University
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Meritdata Technology Co ltd
Xian Jiaotong University
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meritdata Technology Co ltd, Xian Jiaotong University, Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd filed Critical Meritdata Technology Co ltd
Priority to CN202011063051.0A priority Critical patent/CN112131873A/en
Publication of CN112131873A publication Critical patent/CN112131873A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a part of speech tagging method and a part of speech tagging device for a text, wherein the method comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in a text set to be labeled belongs, and determining a preset part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.

Description

Part-of-speech tagging method and device for text
Technical Field
The invention relates to the technical field of computers, in particular to a part-of-speech tagging method and device for a text.
Background
With the development of science and technology, natural language processing has been developed, and natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that can achieve efficient communication between people and computers using natural language, and parts of speech tagging is often required for texts in the process of executing various complex natural language tasks, such as in the field of intelligent question answering.
Part-of-speech tagging refers to tagging part-of-speech tags on each element of a text to be processed, in the prior art, text characteristics are usually formulated manually during part-of-speech tagging, tag distribution is performed by identifying the characteristics of the text, manual customization of the text characteristics easily causes inaccurate feature description, and therefore the accuracy of part-of-speech tagging is low.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a part-of-speech tagging method for a text, which can improve the accuracy of part-of-speech tagging.
The invention also provides a part-of-speech tagging device of the text, which is used for ensuring the realization and the application of the method in practice.
A part-of-speech tagging method for text comprises the following steps:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
Optionally, the method includes a process of setting the part-of-speech tagging model, including:
acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set;
sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model;
and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
Optionally, in the method, the preprocessing is performed on each text to be labeled in the text set to be labeled, and includes:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
Optionally, the method includes that each preprocessed text to be labeled is sequentially input into a preset part-of-speech labeling model, and a part-of-speech labeling result corresponding to each preprocessed text is obtained, where the method includes:
inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input to a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
Optionally, the determining a corpus field to which a text to be processed in the text set to be labeled belongs includes:
acquiring text attribute information of the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
A part-of-speech tagging apparatus for text, comprising:
the receiving unit is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received;
the acquiring unit is used for acquiring a text set to be labeled specified by the text processing instruction under the condition that the task type is a part-of-speech labeling task, wherein the text set to be labeled comprises at least one text to be labeled;
the determining unit is used for determining the corpus field to which the text to be processed in the text set to be labeled belongs and determining a part-of-speech labeling model corresponding to the corpus field;
the preprocessing unit is used for preprocessing each text to be labeled in the text set to be labeled;
and the labeling unit is used for sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
The above apparatus, optionally, further comprises: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
The above apparatus, optionally, the preprocessing unit includes:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
The above apparatus, optionally, the labeling unit includes:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
The above apparatus, optionally, the determining unit includes:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
Compared with the prior art, the invention has the following advantages:
the invention provides a part of speech tagging method and a part of speech tagging device for a text, wherein the method comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for part-of-speech tagging of a text according to the present invention;
FIG. 2 is a flowchart of a part-of-speech tagging model setting process provided by the present invention;
FIG. 3 is a flowchart illustrating a process of processing a preprocessed text to be labeled according to the present invention;
FIG. 4 is an exemplary diagram of an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a part-of-speech tagging apparatus for text according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The embodiment of the invention provides a part-of-speech tagging method for a text, which can be applied to various system platforms, wherein an execution main body of the method can be a computer terminal or a processor of various mobile devices, and a flow chart of the method is shown in fig. 1 and specifically comprises the following steps:
s101: when a text processing instruction is received, determining a task type corresponding to the text processing instruction.
In the method provided by the embodiment of the present invention, the task type may be various natural language processing task types, for example: the system comprises a text classification task, an entity recognition task, a part of speech tagging task or a sentence relation judgment task and the like, wherein different tasks correspond to different task processing operations.
S102: and under the condition that the task is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged.
In the method provided by the embodiment of the present invention, the text to be labeled includes one or more texts to be labeled.
The instruction information of the text processing instruction may include the text set to be labeled or a set identifier of the text set to be labeled, and storage address information of the text set to be labeled; the text set to be annotated specified by the text processing instruction can be obtained based on the set identifier or the storage address information.
The text to be labeled can be texts in various languages, such as a chinese text, an english text, and the like.
S103: and determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field and set in advance.
In the method provided by the embodiment of the present invention, each to-be-processed text in the to-be-labeled text set has its corresponding corpus field, and the corpus fields to which each to-be-processed text belongs may be the same or different.
The language material field can be the language material field of each industry such as the power grid language material field, the e-commerce language material field and the communication language material field.
Specifically, part-of-speech tagging models corresponding to the corpus fields may be determined in pre-established part-of-speech tagging models, and if the corpus fields to which the texts to be processed belong are different, part-of-speech tagging models corresponding to the corpus fields may be respectively determined.
S104: and preprocessing each text to be labeled in the text set to be labeled.
In the method provided by the embodiment of the invention, the text to be annotated is preprocessed to obtain the preprocessed text to be annotated, and the preprocessed text to be annotated is matched with the input interface of the part-of-speech annotation model.
The preprocessed text to be labeled can be in a vector form.
S105: and sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
In the method provided by the embodiment of the present invention, the part-of-speech tagging model may be various natural language processing models, for example, a BERT model.
In the embodiment of the present invention, part-of-speech tagging models corresponding to different corpus fields are different, and under the condition that the corpus fields of the texts to be tagged are different, the texts to be tagged can be sequentially input into the part-of-speech tagging models corresponding to the respective corpus fields to which the texts to be tagged belong, so as to obtain part-of-speech tagging results corresponding to each text to be tagged.
Specifically, the part-of-speech tagging result is that each element of the text to be processed carries part-of-speech information, for example, the text to be processed is: "three zhang pays the electricity fee", the part of speech tagging result may be: "Zhang San n, Payment v, electric charge n", n can represent nouns, v can represent verbs.
The embodiment of the invention provides a part-of-speech tagging method for a text, which comprises the following steps: when a text processing instruction is received, determining a task type corresponding to the text processing instruction; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the method provided by the embodiment of the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the setting process of the part-of-speech tagging model, as shown in fig. 2, includes:
s201: acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set.
In the method provided by the embodiment of the invention, part-of-speech tagging can be performed on the text data by using the part-of-speech tagging tool, and a technician reviews tagging results of the part-of-speech tagging tool and modifies tagging results which do not meet requirements.
Each sample data in the sample data set can be divided into a training sample set, a verification sample set and a test sample set according to a preset proportion.
S202: and training the initial part of speech tagging model in sequence by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model.
The method comprises the steps of verifying an initial part-of-speech tagging model which is trained to determine that model parameters of the initial part-of-speech tagging model meet requirements, and taking the initial part-of-speech tagging model as an alternative part-of-speech tagging model under the condition that the model parameters of the initial part-of-speech tagging model meet the requirements.
S203: and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
And adjusting parameters of the candidate part-of-speech tagging model under the condition that the part-of-speech tagging accuracy rate is not greater than the accuracy rate threshold value until the part-of-speech tagging accuracy rate of the candidate part-of-speech tagging model is greater than the accuracy rate threshold value.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the preprocessing each text to be labeled in the text set to be labeled includes:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
Each text to be labeled in the text set to be labeled can be split by adopting a preset word segmentation method to obtain each text block of each text to be labeled, and the text blocks can be single words or words.
In the method provided by the embodiment of the present invention, the text blocks in the text to be processed may be mapped to obtain a dictionary index set corresponding to the text to be processed, where the dictionary index set includes dictionary indexes of the text blocks of the text to be processed, and the dictionary index set may be used as an input of a model; if the length of the text to be processed is greater than the preset length threshold, the head and/or the tail of the text to be processed can be randomly removed, so that the length of the text to be processed is equal to the length threshold, and if the length of the text to be processed is smaller than the length threshold, the text to be processed can be subjected to the completion operation, so that the length of the text to be processed subjected to the completion operation is equal to the length threshold.
Specifically, each text block may be converted into a dictionary index (i.e., id) as a model input by pre-training the model BERT's own dictionary vocab.
In the method provided in the embodiment of the present invention, based on the foregoing implementation process, specifically, the processing each preprocessed text to be annotated by applying a preset part-of-speech tagging model is shown in fig. 3, and includes:
s301: inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; and the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders.
In the method provided by the embodiment of the present invention, the encoder may be a bidirectional transforms encoder.
S302: triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to the to-be-processed texts, wherein each preprocessed to-be-tagged text is input of a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
In the method provided by the embodiment of the present invention, the output of the last encoder in the part-of-speech tagging model is the part-of-speech tagging result corresponding to the text to be processed, and the next preprocessed text to be processed can be input to the part-of-speech tagging model under the condition that the part-of-speech result corresponding to the text to be processed currently input to the part-of-speech tagging model is obtained.
In the method provided by the embodiment of the present invention, based on the implementation process, specifically, the determining, in each text to be processed in a text set to be labeled, a corpus field to which the text to be processed in the text set to be labeled belongs includes:
acquiring text attribute information in the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
In the method provided by the embodiment of the present invention, the corpus field to which the text to be processed in the file set to be labeled belongs may be determined in a preset configuration file based on the text attribute information, and the configuration file records the corresponding relationship between the text attribute information and the corpus field.
Referring to fig. 4, an exemplary diagram of an implementation scenario provided by the present invention is shown, where the implementation scenario provided by the embodiment of the present invention includes a server 401 and a terminal 402.
In practice, the terminal 402 shown in fig. 4 may be an electronic device such as a mobile phone, a tablet computer, a personal computer, and the like. The server 401 may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center. The server 401 and the terminal 402 establish a communication connection through a network.
Before the implementation of the embodiment of the present invention, preparation work may be performed, and the preparation work includes: A. and (5) corpus labeling in the power grid field. B. Load BERT model and fine tune. C. Model deployment and application.
Wherein, the process of the corpus labeling in the power grid field can be as follows: collecting and sorting text corpora related to the power grid field; and through a marking tool, the text is manually marked by word segmentation and part of speech.
The process of loading the BERT model and fine tuning may be: loading the BERT model with the source opened; processing the labeled corpus to enable the labeled corpus to be adaptive to an interface of the model; dividing data into a training set, a verification set and a test set; performing fine tuning training by using a CPU/GPU; and carrying out structured storage on the final model.
The process of model deployment and application may be: and deploying the model and opening a relevant interface for a user to call.
Specifically, the model may be deployed in the server 401, and the terminal 402 may send a text processing instruction to the server by calling an interface, where when the server 401 receives the text processing instruction, the task type corresponding to the text processing instruction is determined; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by a text processing instruction; determining a corpus field to which a text to be processed in a text set to be labeled belongs, and if the corpus field is the electric network corpus field, preprocessing each text to be labeled in the text set to be labeled by taking the deployed model as a part-of-speech labeling model corresponding to the text processing instruction; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be processed.
Embodiments of the present invention relate to networks that are media providing communication links and may include various types of connections, such as wire, wireless communication links, or fiber optic cables.
Corresponding to the method described in fig. 1, an embodiment of the present invention further provides a part-of-speech tagging apparatus for a text, which is used for specifically implementing the method in fig. 1, and the part-of-speech tagging apparatus for a text provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and a schematic structural diagram of the apparatus is shown in fig. 5, and specifically includes:
a receiving unit 501, configured to determine a task type corresponding to a text processing instruction when the text processing instruction is received;
an obtaining unit 502, configured to obtain a to-be-labeled text set specified by the text processing instruction when the task type is a part-of-speech labeling task, where the to-be-labeled text set includes at least one to-be-labeled text;
a determining unit 503, configured to determine a corpus field to which a text to be processed in the text set to be labeled belongs, and determine a part-of-speech labeling model corresponding to the corpus field;
a preprocessing unit 504, configured to preprocess each text to be labeled in the text set to be labeled;
and the labeling unit 505 is configured to sequentially input each preprocessed text to be labeled into a preset part-of-speech labeling model, so as to obtain a part-of-speech labeling result corresponding to each text to be processed.
The embodiment of the invention provides a part-of-speech tagging device for a text, which is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received; under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged; determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field; preprocessing each text to be labeled in the text set to be labeled; and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled. By applying the device provided by the embodiment of the invention, part-of-speech tagging can be performed by applying the part-of-speech tagging model corresponding to the corpus field to which the text to be tagged belongs, so that the accuracy of part-of-speech tagging is improved.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the part-of-speech tagging apparatus for text further includes: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the preprocessing unit 504 includes:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the labeling unit 505 includes:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
In an embodiment provided in the embodiment of the present invention, based on the above scheme, optionally, the determining unit 503 includes:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
The specific principle and the implementation process of each unit and each module in the text part-of-speech tagging apparatus disclosed in the embodiment of the present invention are the same as those of the text part-of-speech tagging method disclosed in the embodiment of the present invention, and reference may be made to corresponding parts in the text part-of-speech tagging method provided in the embodiment of the present invention, which are not described herein again.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the equipment where the storage medium is located is controlled to execute the part-of-speech tagging method of the text.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 6, which specifically includes a memory 601 and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601 and configured to be executed by one or more processors 603 to perform the following operations on the one or more instructions 602:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The part-of-speech tagging method of the text provided by the invention is described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A part-of-speech tagging method for a text, comprising:
when a text processing instruction is received, determining a task type corresponding to the text processing instruction;
under the condition that the task type is a part-of-speech tagging task, acquiring a text set to be tagged specified by the text processing instruction, wherein the text set to be tagged comprises at least one text to be tagged;
determining a corpus field to which a text to be processed in the text set to be labeled belongs, and determining a part-of-speech labeling model corresponding to the corpus field;
preprocessing each text to be labeled in the text set to be labeled;
and sequentially inputting each preprocessed text to be labeled into the part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
2. The method according to claim 1, wherein the setting process of the part-of-speech tagging model comprises:
acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set;
sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model;
and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
3. The method according to claim 1, wherein the preprocessing each text to be labeled in the text set to be labeled comprises:
splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and for each text to be labeled, mapping each text block of the text to be labeled so as to finish the pretreatment of the text to be labeled.
4. The method of claim 1, wherein sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled comprises:
inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input to a first encoder in the part-of-speech tagging model, and output of each encoder is used as input of a next encoder.
5. The method according to claim 1, wherein the determining a corpus area to which the text to be processed in the text set to be labeled belongs comprises:
acquiring text attribute information of the text set to be labeled;
and determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
6. A part-of-speech tagging apparatus for text, comprising:
the receiving unit is used for determining a task type corresponding to a text processing instruction when the text processing instruction is received;
the acquiring unit is used for acquiring a text set to be labeled specified by the text processing instruction under the condition that the task type is a part-of-speech labeling task, wherein the text set to be labeled comprises at least one text to be labeled;
the determining unit is used for determining the corpus field to which the text to be processed in the text set to be labeled belongs and determining a part-of-speech labeling model corresponding to the corpus field;
the preprocessing unit is used for preprocessing each text to be labeled in the text set to be labeled;
and the labeling unit is used for sequentially inputting each preprocessed text to be labeled into a preset part-of-speech labeling model to obtain a part-of-speech labeling result corresponding to each text to be labeled.
7. The apparatus of claim 6, further comprising: a model setting unit; the model setting unit is used for acquiring an initial part-of-speech tagging model and a sample data set of the corpus field; each sample data in the sample data set is text data which is labeled by part of speech in advance; the sample data set is divided into a training sample set, a verification sample set and a test sample set; sequentially training the initial part of speech tagging model by applying each sample data in the training sample set, and verifying the trained initial part of speech tagging model based on the sample data in the verification sample set to obtain an alternative part of speech tagging model; and testing the alternative part-of-speech tagging model by applying the test sample set to obtain a test result, judging the part-of-speech tagging accuracy of the alternative part-of-speech tagging model according to the test result, and if the part-of-speech tagging accuracy is greater than a preset accuracy threshold, taking the alternative part-of-speech tagging model as the part-of-speech tagging model corresponding to the corpus field.
8. The apparatus of claim 6, wherein the pre-processing unit comprises:
the splitting subunit is used for splitting each text to be labeled in the text set to be labeled to obtain each text block of each text to be labeled; each of the text blocks includes at least one character;
and the mapping subunit is used for mapping each text block of each text to be labeled so as to complete the preprocessing of the text to be labeled.
9. The apparatus of claim 6, wherein the labeling unit comprises:
the input subunit is used for inputting each preprocessed text to be labeled into a preset part-of-speech labeling model; the part-of-speech tagging model is obtained by sequentially stacking a plurality of encoders;
and the triggering subunit is used for triggering each encoder in the part-of-speech tagging model to sequentially process the input preprocessed to-be-tagged texts to obtain part-of-speech tagging results corresponding to each to-be-processed text, wherein each preprocessed to-be-tagged text is input by a first encoder in the part-of-speech tagging model, and the output of each encoder is used as the input of a next encoder.
10. The apparatus of claim 6, wherein the determining unit comprises:
the acquiring subunit is used for acquiring the text attribute information of the text set to be labeled;
and the determining subunit is used for determining the corpus field to which the text to be processed in the file set to be labeled belongs based on the text attribute information.
CN202011063051.0A 2020-09-30 2020-09-30 Part-of-speech tagging method and device for text Pending CN112131873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011063051.0A CN112131873A (en) 2020-09-30 2020-09-30 Part-of-speech tagging method and device for text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011063051.0A CN112131873A (en) 2020-09-30 2020-09-30 Part-of-speech tagging method and device for text

Publications (1)

Publication Number Publication Date
CN112131873A true CN112131873A (en) 2020-12-25

Family

ID=73843616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011063051.0A Pending CN112131873A (en) 2020-09-30 2020-09-30 Part-of-speech tagging method and device for text

Country Status (1)

Country Link
CN (1) CN112131873A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844476A (en) * 2017-10-19 2018-03-27 广州索答信息科技有限公司 A kind of part-of-speech tagging method of enhancing
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844476A (en) * 2017-10-19 2018-03-27 广州索答信息科技有限公司 A kind of part-of-speech tagging method of enhancing
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Similar Documents

Publication Publication Date Title
CN107491534B (en) Information processing method and device
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
CN109284399B (en) Similarity prediction model training method and device and computer readable storage medium
CN107437417B (en) Voice data enhancement method and device based on recurrent neural network voice recognition
CN111274815A (en) Method and device for mining entity attention points in text
CN110019742B (en) Method and device for processing information
JP2020030408A (en) Method, apparatus, device and medium for identifying key phrase in audio
US20220358292A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
CN113239204A (en) Text classification method and device, electronic equipment and computer-readable storage medium
WO2023045186A1 (en) Intention recognition method and apparatus, and electronic device and storage medium
CN113051895A (en) Method, apparatus, electronic device, medium, and program product for speech recognition
CN110807097A (en) Method and device for analyzing data
CN116701604A (en) Question and answer corpus construction method and device, question and answer method, equipment and medium
US20230052906A1 (en) Entity Recognition Method and Apparatus, and Computer Program Product
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN113779202B (en) Named entity recognition method and device, computer equipment and storage medium
CN112131873A (en) Part-of-speech tagging method and device for text
CN112925889B (en) Natural language processing method, device, electronic equipment and storage medium
CN112732423B (en) Process migration method, device, equipment and medium
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN114691716A (en) SQL statement conversion method, device, equipment and computer readable storage medium
CN113806230A (en) Software testing method, device, equipment and medium based on case voice
CN110083807B (en) Contract modification influence automatic prediction method, device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination