CN113468878A - Part-of-speech tagging method and device, electronic equipment and storage medium - Google Patents

Part-of-speech tagging method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113468878A
CN113468878A CN202110787205.9A CN202110787205A CN113468878A CN 113468878 A CN113468878 A CN 113468878A CN 202110787205 A CN202110787205 A CN 202110787205A CN 113468878 A CN113468878 A CN 113468878A
Authority
CN
China
Prior art keywords
slice
text
participle
kth
vector representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110787205.9A
Other languages
Chinese (zh)
Inventor
李扬名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110787205.9A priority Critical patent/CN113468878A/en
Publication of CN113468878A publication Critical patent/CN113468878A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a part-of-speech tagging method and device, electronic equipment and a storage medium, and is applied to the technical field of natural language processing. The method comprises the following steps: the method comprises the steps of obtaining an undivided text and a slice set in a text to be labeled, wherein the slice set comprises k-1 divided slices in the text to be labeled, the undivided text is a text except the k-1 slices in the text to be labeled, determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth slice in the slice set, obtaining a candidate participle set of the kth slice in the undivided text, and determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set, wherein the kth slice is used for indicating the participle corresponding to the kth slice and the part-of-word. By adopting the method and the device, the part-of-speech tagging efficiency can be improved.

Description

Part-of-speech tagging method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a part-of-speech tagging method and apparatus, an electronic device, and a storage medium.
Background
At present, the words and parts of speech of each word in the text can be obtained by preprocessing the text through a natural language processing technology, and then the text with parts of speech labels can be applied to downstream tasks, such as text semantic analysis and the like. Since the part-of-speech tagging result of the text directly affects downstream tasks, the part-of-speech tagging of the text is very important. The existing part-of-speech tagging technology generally performs sequence tagging on characters in a text, extracts continuous characters as words based on the sequence tagging, and analyzes the part-of-speech of the words. Therefore, how to improve the part-of-speech tagging efficiency in the text preprocessing process becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a part-of-speech tagging method and device, electronic equipment and a storage medium, and the part-of-speech tagging efficiency can be improved.
In one aspect, an embodiment of the present application provides a part-of-speech tagging method, where the method includes:
acquiring an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
acquiring a candidate word segmentation set of the kth slice from the undivided text;
determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
In one aspect, an embodiment of the present application provides a part-of-speech tagging apparatus, where the apparatus includes:
the acquisition module is used for acquiring an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
the determining module is used for determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
the obtaining module is further configured to obtain a candidate word segmentation set of the kth slice from the undivided text;
the determining module is further configured to determine the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
In one aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to perform some or all of the steps in the above method.
In one aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, are used to perform some or all of the steps of the above method.
Accordingly, according to an aspect of the present application, there is provided a computer program product or computer program comprising program instructions stored in a computer readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions to cause the computer device to execute the part-of-speech tagging method provided above.
The method and the device for identifying the segmented word in the text to be labeled can obtain an undivided text and a slice set in the text to be labeled, determine the dependency relationship of a kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of a (k-1) th slice in the slice set, obtain a candidate word segmentation set of the kth slice in the undivided text, and determine the kth slice from the candidate word segmentation set according to the dependency relationship of the kth slice and the vector representation of each candidate word segmentation in the candidate word segmentation set. By implementing the method, the kth slice in the text to be labeled can be determined from the undivided text, and the corresponding participles and the parts of speech of the participles are obtained according to the kth slice, so that part of speech labeling can be realized, time complexity and workload can be reduced, the efficiency of part of speech labeling can be improved, furthermore, the dependency relationship of the kth slice is determined, the relevance of the sliced set divided before can be captured during division to realize the prediction of the parts of speech and the parts of speech of the participles corresponding to the slice, and the accuracy of the part of speech labeling can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an application architecture according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a part-of-speech tagging method according to an embodiment of the present application;
fig. 3 is a schematic flow chart of a part-of-speech tagging method according to an embodiment of the present application;
fig. 4 is a scene schematic diagram for acquiring a candidate word segmentation set according to an embodiment of the present application;
fig. 5 is a schematic view of a scene for acquiring a slice set according to an embodiment of the present application;
fig. 6 is a scene schematic diagram of part-of-speech tagging based on a model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a part-of-speech tagging apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The part-of-speech tagging method provided by the embodiment of the application can be realized in electronic equipment, and the electronic equipment can be a server or terminal equipment. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like.
The embodiment of the application can relate to the technical field of artificial intelligence, for example, can particularly relate to the technical field of natural language processing so as to realize text processing. Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The electronic device can perform text processing by executing the technical scheme of the application to realize word division and part-of-speech tagging of words in the text.
In some embodiments, part-of-speech tagging of a text to be tagged by an electronic device is implemented through multiple rounds of iterative partitioning, taking one round of the multiple rounds as an example, the electronic device obtains an undivided text in the round and a slice set obtained through previous partitioning, and determines a first slice in the undivided text based on the undivided text and the slice set, where the slice indicates corresponding participles and parts-of-speech of the participle, the participle is a first participle (leftmost participle) in the undivided text, and the part-of-speech tagging of the text to be tagged is completed according to the slice obtained through each round of partitioning, that is, the corresponding participle and the parts-of-speech of the participle. Optionally, the part-of-speech tagging mode may also be implemented by the electronic device through a part-of-speech tagging model.
For example, referring to fig. 1, fig. 1 is a schematic view of an application architecture provided in an embodiment of the present application, through which a part-of-speech tagging method provided in the present application can be executed. As shown in fig. 1, fig. 1 may include an electronic device, a trained part-of-speech tagging model deployed in the electronic device. The method comprises the steps that a text to be labeled is input into electronic equipment, the electronic equipment performs multi-round division on the text to be labeled by using a trained part-of-speech labeling model, slices obtained through previous division and the rest non-divided texts can be used for determining the leftmost (namely the first) slice from the non-divided texts in each round of division, and the divided slices can be used for representing corresponding participles and the part-of-speech of the participles. The method and the device for processing the text to be labeled can perform multiple rounds of iterative division on the text to be labeled to obtain a final slice set of the text to be labeled, wherein multiple slices contained in the slice set represent multiple participles corresponding to the text to be labeled and the part of speech of each participle. The slice is used to indicate the corresponding word and the part of speech of the word, for example, the 2 nd slice is used to indicate that the corresponding word is "building", and the part of speech of the word is noun, etc.
It should be understood that fig. 1 merely illustrates a possible application architecture of the present application, and does not limit the specific architecture of the present application, that is, the present application may also provide other forms of application architectures.
Optionally, in some embodiments, the electronic device may execute the part-of-speech tagging method according to an actual service requirement, so as to improve the part-of-speech tagging efficiency of the text. The technical scheme can be applied to any scene needing part-of-speech tagging of the text. For example, the method can be applied to a semantic analysis scene of a text, that is, when an upstream task (e.g., text preprocessing) stage of a semantic analysis task is performed, by using the technical solution of the present application, in a process of performing multiple rounds of division on a text to be annotated (i.e., a text to be semantically analyzed in the scene), an undivided text in a kth round and a slice set including k-1 slices obtained by performing k-1 rounds of division are obtained, and a first slice of the undivided text (i.e., a kth slice in the text to be annotated) is determined by using vector representations of the undivided text and vector representations of k-1 slices in the slice set until a last slice of the text to be annotated is obtained by the division. By the technical scheme, the part-of-speech division efficiency can be improved, and further the overall efficiency under a text semantic analysis scene is improved.
Optionally, data related to the present application, such as a slice set of a text to be annotated, may be stored in a database, or may be stored in a blockchain, such as by a blockchain distributed system, which is not limited in the present application.
It is to be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as can be known by those skilled in the art, with the evolution of system architecture and the emergence of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
Based on the above description, the present application provides a part-of-speech tagging method, which can be performed by the above-mentioned electronic device. Referring to fig. 2, fig. 2 is a schematic flow chart of a part-of-speech tagging method according to an embodiment of the present application. As shown in fig. 2, a flow of the part-of-speech tagging method according to the embodiment of the present application may include the following steps:
s201, obtaining an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled.
The text to be labeled can be composed of one text sentence or a plurality of text sentences, and one or a plurality of slices can be obtained by dividing the text sentences. When the text to be labeled is composed of a plurality of text sentences, the electronic device may split the text to be labeled into the plurality of text sentences, and execute the technical scheme of the present application in units of one text sentence. Namely, when the part of speech tagging is carried out on the text to be tagged formed by a plurality of text sentences or the text to be tagged formed by one text sentence, the process and the principle are the same. Therefore, the following description will be given by taking a part-of-speech tagging process when a text to be tagged is a text sentence as an example.
In a possible implementation manner, the electronic device performs R rounds of iterative division on a text to be annotated to obtain R slices, each slice can be used to indicate a corresponding word segmentation and a part-of-speech of the word segmentation, the process and principle of each round of division are the same, here, the k-th round of division is taken as an example for description, k is a positive integer, and k is less than or equal to R. In the k-th round of division, the electronic equipment acquires an undivided text and a slice set in the text to be labeled, wherein the undivided text is determined by the text to be labeled and a division result of the previous k-1 round, namely the undivided text is a text except for k-1 slices (corresponding participles) in the text to be labeled, the division result of the previous k-1 round comprises the slice set, and the slice set comprises the divided k-1 slices corresponding to the text to be labeled.
For example, the text to be labeled is "flood discharge building and related engineering", k is 2, the division result of the previous k-1 rounds (i.e., the first round) indicates that the obtained 1 st slice is [ flood discharge ], and the slice [ flood discharge ] can represent the corresponding participle and the part of speech corresponding to the participle, so the text is not divided into "building and related engineering", and the slice set is { [ flood discharge }.
In some embodiments, the electronic device adds a position tag to each word in the text to be annotated, and in obtaining the slice set, the position tag of each slice in the slice set and the position tag of the text that is not divided can be obtained based on the position tag of each word, and the position tag of a slice can represent the slice, that is, the slice can be represented as (i, j), where i represents the position tag of the first word in the text to be annotated in the slice, and j represents the position tag of the last word in the text to be annotated in the slice, so that the slice is a set of words whose position tags are (i, i + 1.... j).
For example, the text to be labeled is "flood discharge building and related engineering", where the position label of each word is (1,2, 3,4, 5, 6, 7, 8, 9) in sequence, if the slice is "flood discharge", the position label of the slice is (1, 2), the position label of the text "building and related engineering" which is not divided is (3, 9), and the text composed of the words from the position label 3 to the position label 9 in the text to be labeled is an undivided text.
Further, since part-of-speech of the segmented word corresponding to the slice is obtained in the dividing process, the representation form of the slice that can be divided may include a position label of the segmented word and a part-of-speech label of the part-of-speech of the segmented word, i.e., (i, j, l), where l represents the part-of-speech label of the slice. For example, each part of speech in the part of speech set corresponds to 1 tag, and let the part of speech of the participle corresponding to the slice [ flood ] be XX, and the part of speech tag corresponding to the part of speech XX be NN, so that the slice [ flood ] can be represented as (1,2, NN).
S202, according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set, determining the dependency relationship of the (k) th slice in the text to be labeled.
Wherein the dependency of the kth slice may be used to represent the correlation of the kth slice with the set of slices.
In a possible embodiment, since one slice corresponds to one participle, so that k-1 slices included in the slice set correspond to k-1 participles respectively, and the participle corresponding to the k-1 th slice is the kth-1 th participle, the electronic device may acquire the vector representation of the k-1 th slice in the slice set in such a manner that the context representation of the k-1 th participle is acquired, the vector representation of the k-1 th participle is determined according to the context representation of the k-1 th participle, and the vector representation of the k-1 th slice is determined according to the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle. The vector representation of the (k-1) th slice can be used to characterize the process of fusing the vector representation of the (k-1) th participle and the vector representation of the part of speech of the (k-1) th participle, i.e. the vector representation of the (k-1) th slice can be used to represent the fused vector representation of the (k-1) th participle and the vector representation of the part of speech of the (k-1) th participle.
In some embodiments, the electronic device obtaining the context representation of the k-1 th participle may specifically be to construct a word vector matrix Et(also called word embedding matrix) and obtaining the vector representation of each word in the text to be labeled based on the word vector matrix, namely et=Et(xt) The context representation of each word is obtained by utilizing a bidirectional Long and Short Term Memory network model (bidirectional LSTM (Long Short Term Memory Networks, which is an improved version of recurrent neural network and can improve the modeling Long-Term dependence capacity thereof through a gate mechanism)) and the vector representation of each word
Figure BDA0003161398530000071
Namely:
Figure BDA0003161398530000072
wherein the content of the first and second substances,
Figure BDA0003161398530000073
the following representation of the t-th word in the text to be annotated,
Figure BDA0003161398530000074
the above representation of the t-th word in the text to be annotated,
Figure BDA0003161398530000075
the representation fuses (i.e. splices) the context representation of the t-th word with the above representation to obtain the context representation of the t-th word.
It is understood that obtaining the context of each word is represented as a process of encoding the text to be annotated.
Therefore, let the k-1 th slice be (i, j, l), where i and j are the position labels of the slice in the text to be labeled; the electronic device determines a vector representation of the k-1 th participle from the context representation of the k-1 th participle corresponding to the k-1 th slice
Figure BDA0003161398530000076
The method specifically comprises the following steps:
Figure BDA0003161398530000081
wherein the content of the first and second substances,
Figure BDA0003161398530000082
the lower representation of the (k-1) th participle is represented, namely the difference value of the lower representation of the jth word (the last word of the (k-1) th participle) in the text to be annotated and the lower representation of the (i-1) th word in the text to be annotated,
Figure BDA0003161398530000083
the above representation of the (k-1) th participle is represented, namely the difference value of the above representation of the ith word (the first word of the (k-1) th participle) in the text to be annotated and the above representation of the (j + 1) th word in the text to be annotated, and the vector representation of the (k-1) th participle
Figure BDA0003161398530000084
Namely the result of fusing the lower representation of the kth-1 participle and the upper representation of the kth-1 participle.
In some embodiments, the electronic device determines the vector representation for the k-1 th slice based on the vector representation for the k-1 th participle and the vector representation for the part of speech of the k-1 th participle corresponding to the k-1 th slice
Figure BDA0003161398530000085
The method specifically comprises the following steps:
Figure BDA0003161398530000086
wherein ik-1The position label, j, of the first word of the (k-1) th participle in the text to be annotatedk-1A position label of the last word of the (k-1) th participle in the text to be annotated, El(lk-1) Vector representation of part of speech of the (k-1) th participle, and vector representation of the (k-1) th participle
Figure BDA0003161398530000087
Namely the vector representation of the kth-1 participle is fused with the vector representation of the part of speech of the kth-1 participle.
In some embodiments, the electronic device can use a vector representation of the undivided text with
Figure BDA0003161398530000088
Wherein ikThe position label of the first character of the text to be marked is represented, the position label of the last character of the text to be marked is represented by n, and the position label of the last character of the text to be marked is also represented by n. It can be understood that, since each round of division is to divide the leftmost slice in the undivided text, the position label of the first word of the undivided text in the text to be labeled is also the position label of the first word of the participle corresponding to the kth slice in the text to be labeled. And the obtaining mode of the vector representation of the non-divided text can be the same as the obtaining mode of the vector representation of the k-1 th slice, which is not described herein again.
In one possible implementation, the electronic device may further obtain the dependency relationship of the (k-1) th slice when determining the dependency relationship of the (k-1) th slice, and therefore, the electronic device may determine the dependency relationship of the (k-1) th slice in the text to be annotated according to the vector representation of the undivided text and the vector representation of the (k-1) th slice, and determine the dependency relationship of the (k-1) th slice in the text to be annotated, that is, determine the dependency relationship of the first slice of the undivided text. For example, a target vector representation is obtained according to the vector representation of the undivided text and the vector representation of the (k-1) th slice, and the dependency relationship of the (k) th slice in the text to be labeled is determined according to the dependency relationship of the target vector representation and the (k-1) th slice.
It can be understood that when determining the dependency relationship of the kth slice, the dependency relationship of the (k-1) th slice needs to be obtained, and the obtaining manner of the dependency relationship of the (k-1) th slice can be the same as the obtaining manner of the dependency relationship of the kth slice, that is, the dependency relationship of the (k-1) th slice can be determined according to the undivided text of the (k-1) th round, the slice set of the (k-1) th round and the dependency relationship of the (k-2) th slice. And the slice set of the k-1 round comprises k-2 slices, and the undivided text of the k-1 round is the text except the k-2 slices in the text to be labeled.
In some embodiments, the dependency of the kth slice
Figure BDA0003161398530000091
The method specifically comprises the following steps:
Figure BDA0003161398530000092
wherein the content of the first and second substances,
Figure BDA0003161398530000093
vector representation representing undivided text
Figure BDA0003161398530000094
And vector representation of the k-1 th slice
Figure BDA0003161398530000095
The target vector representation resulting from the fusion is performed,
Figure BDA0003161398530000096
indicating the dependency of the (k-1) th slice.
It can be understood that when the kth slice is determined, the k-1 slice division result is relied on, that is, according to different contexts of the text to be labeled, there may be multiple possible division situations and multiple possible parts of speech for the same slice, therefore, in each round of division, the correlation between the slice set (the slice divided before) and the slice divided in the round is combined, so that the division of the participle corresponding to the slice and the labeling of the part of speech corresponding to the participle can be more accurate, and the division errors caused by mutually independent division of the participle corresponding to each slice and the labeling errors caused by misleading of other possible parts of speech when the part of speech is independently labeled are avoided.
S203, acquiring a candidate word segmentation set of the kth slice in the undivided text.
In a possible implementation manner, the electronic device may obtain a candidate participle set of a kth slice from the undivided text, where the candidate participle set is all possible cases of a participle corresponding to the kth slice, that is, a participle corresponding to the kth slice is a candidate participle in the candidate participle set. And the k slice in the text to be labeled can also be called the first slice in the undivided text.
Since the kth slice is the leftmost slice in the undivided text, the electronic device may acquire the candidate participle set of the kth slice in a specific manner, that is, acquire a continuous subset including the first word in the undivided text from the undivided text, and determine the continuous subset as the candidate participle set of the kth slice. That is, the electronic device takes all continuous subsets of the undivided text starting from the first word as the candidate participle set. For example, the undivided text is "related engineering", so all consecutive subsets starting from the first word "facies" are [ facies ], [ related engineering ], and are taken as the candidate participle set that is not divided into the first slice in the present document.
In one possible embodiment, each word of the text to be labeled is added with a position label, and the position label of the text not labeled is (i)kN), and the position label of the participle corresponding to the kth slice can be represented as (i)k,jk) Therefore, the position label of each candidate participle in the candidate participle set of the k-th slice can be represented as Sk={(ik,ik),(ik,ik+1),(ik,ik+2),...,(ik,n)}。
For example, the non-divided text is "related engineering", and the position label of the non-divided text is used for representingTo (6, 9), k is 4, so the position label of each candidate participle in the candidate participle set of the 4 th slice is denoted as S4={(6,6),(6,7),(6,8),(6,9)}。
S204, determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set.
The kth slice may be used to indicate the participle corresponding to the kth slice and the part of speech of the participle.
In one possible implementation, the electronic device may determine, according to the dependency relationship of the kth slice and the vector representation of each candidate participle, a probability that each candidate participle is a participle corresponding to the kth slice, that is, a probability that each candidate participle is a kth participle, and determine the kth participle from the candidate participle set according to the probability of each candidate participle. And the electronic device may determine the candidate participle with the highest probability as the kth participle.
In some embodiments, the electronic device may determine a part-of-speech of the kth participle from the dependency of the kth slice and the vector representation of the kth participle, and determine the kth slice from the determined kth participle and the part-of-speech of the kth participle.
The electronic device determines the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle, and specifically may obtain a part of speech set, determine the probability of each part of speech in the part of speech set corresponding to the kth participle, and determine the part of speech of the kth slice from the part of speech set according to the probability that the part of speech of the kth participle is each part of speech. And the electronic device may determine the part of speech with the highest probability as the part of speech of the k-th participle.
Optionally, the electronic device may first calculate a probability of each candidate participle in the candidate participle set, determine a kth participle based on the probability of each candidate participle, then calculate a probability that the kth participle corresponds to each part of speech, and determine a part of speech of the kth participle based on the probability of each part of speech; or the probability of each candidate participle in the candidate participle set and the probability of each part of speech corresponding to each candidate participle may be simultaneously calculated, and then the part of speech of the kth participle and the kth participle are determined, so as to obtain the kth slice. It can be understood that the process of obtaining the slice set by performing R-round division on the text to be labeled is the process of decoding the labeled text.
In some embodiments, the electronic device may determine, according to the dependency relationship of the kth slice and the vector representation of each candidate participle, a probability that each candidate participle is a participle corresponding to the kth slice by:
Figure BDA0003161398530000111
wherein the content of the first and second substances,
Figure BDA0003161398530000112
representing the probability that the candidate participle is the kth participle;
Figure BDA0003161398530000113
a vector representation representing the candidate participle; (i, j) position tags representing candidate participles; skA set of candidate participles representing a k-th slice; vSA value, or vector, or matrix, or a function (e.g. V) on k that can be set for the relevant service person based on empirical valuesS=fS(k) Or when implemented by a set model, VSBut also model parameters in the model.
And the way for the electronic device to determine the probability of each part of speech in the part of speech set corresponding to the kth participle may be:
Figure BDA0003161398530000114
wherein the part of speech set is L,
Figure BDA0003161398530000115
representing the probability of the part of speech corresponding to the kth participle; el(l) Direction of speech expression(ii) a quantity representation; vlA value, or vector, or matrix, or a function (e.g. V) on k that can be set for the relevant service person based on empirical valuesl=fl(k) Or when implemented by a set model, VlBut also model parameters in the model. In addition, VlAnd VSMay be the same or different.
In a possible implementation manner, a trained part-of-speech tagging model may be deployed in the electronic device, and part or all of the steps provided above may be implemented by using the part-of-speech tagging model, that is, the part-of-speech tagging model may be used to perform an encoding process and a decoding process on a text to be tagged, and finally a final slice set containing participles and part-of-speech of the participles, which is obtained after the text to be tagged is divided by the R round, is output.
In the embodiment of the application, the electronic equipment can acquire an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in a text to be labeled, the text to be labeled is a text except the k-1 divided slices, the dependency relationship of the kth slice in the text to be labeled is determined according to the vector representation of the text to be labeled and the vector representation of the kth slice in the slice set, the candidate participle set of the kth slice is obtained in the text to be labeled, and the kth slice is determined from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set. By implementing the method provided by the embodiment of the application, in the kth division, the leftmost slice and the part of speech of the slice in the un-divided text can be determined to realize part of speech tagging of iteration increment, so that the time complexity and the workload can be reduced, the part of speech tagging efficiency can be improved, further, by determining the dependency relationship of the kth slice, the relevance of the previous slice set can be captured during division to realize the prediction of the part of speech of the slice and the slice, and further, the part of speech tagging accuracy can be improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a part-of-speech tagging method according to an embodiment of the present application, where the method can be executed by the above-mentioned electronic device. As shown in fig. 3, a flow of the part-of-speech tagging method in the embodiment of the present application may include the following steps:
s301, obtaining an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled.
In a possible implementation manner, the electronic device can implement part-of-speech tagging of the text to be tagged through the deployed part-of-speech tagging model, so that the electronic device can train the part-of-speech tagging model to be trained to obtain the trained part-of-speech tagging model. The part-of-speech tagging model can comprise a coding part and a decoding part, the coding part can be a bidirectional LSTM model, the decoding part can be a unidirectional LSTM model, the electronic equipment can utilize a model of the coding part in the part-of-speech tagging model to code the sample text to obtain context representation of each word in the sample text, and the model of the decoding part in the part-of-speech tagging model is utilized to train the part-of-speech tagging model to be trained on the basis of the context representation of each word in the sample text, the sample text set and the sample slice set of each sample text to obtain the trained part-of-speech tagging model. Wherein the sample slice set may indicate all sample tokens of the sample text and sample parts-of-speech of the sample tokens.
In some embodiments, the training of the part-of-speech tagging model to be trained by the electronic device may specifically be to obtain the sample text set and a sample slice set corresponding to each sample text in the sample text set, train the model by using the sample text as input and the corresponding sample slice set as output, and obtain the part-of-speech tagging model, where the part-of-speech tagging model includes trained model parameters, such as a model parameter W involved in calculating a probability of a candidate participle being a participle corresponding to a slicesAnd the like. The sample text is also called real text, the sample slice set is also called real slice, and all the participles and parts of speech of the participles of the sample text indicated by the sample slice set are real parts of speech of the real participles and the participles in the real text. In training, the determination can be made according to the input sample text and the output sample slice setAnd correcting the model parameters of the part-of-speech tagging model to be trained by using the target loss function to obtain the trained part-of-speech tagging model. It can be understood that the probability of the sample word segmentation corresponding to the sample slice and the probability of the sample part-of-speech of the sample word segmentation tend to be the maximum in the training process.
Wherein, the sample text set includes a plurality of sample texts, one sample text may be one text sentence, and the sample text may be represented as X ═ X1,x2,x3,...,xn]The sample slice set corresponding to the sample text may be represented as Y ═ Y1,y2,y3,...,ym]N denotes the total number of words of the sample text, and m denotes the total number of slices corresponding to the sample text. It is understood that the set of sample slices of the sample text is the true slice of the sample text. And, a slice in the sample slice set may be represented as a tuple yz=(iz,jz,lz) Wherein (i)z,jz) Represents slice yzPosition label of corresponding sample word in sample text, lzAnd a part-of-speech tag corresponding to the sample part-of-speech indicating the sample participle, namely a part-of-speech tag corresponding to the real part-of-speech.
Optionally, the target loss function may be a cross entropy loss function, and specifically may be:
Figure BDA0003161398530000131
wherein the content of the first and second substances,
Figure BDA0003161398530000132
representing the probability of a sample segmentation corresponding to the z-th sample slice in the set of sample slices,
Figure BDA0003161398530000133
and the probability of the sample part-of-speech of the sample word segmentation corresponding to the z-th sample slice in the sample slice set is represented.
In a possible implementation manner, after obtaining the trained part-of-speech tagging model, the electronic device may input the text to be tagged into the part-of-speech tagging model, and perform R-round iterative partitioning on the text to be tagged by using the part-of-speech tagging model to obtain a slice set including R slices and parts of speech corresponding to each slice, where the process and principle of each round of partitioning are the same. In the kth round of division, the specific implementation of the electronic device obtaining the non-divided text and the slice set in the text to be annotated through the part-of-speech annotation model may refer to the related description in step S201, and details are not described here again.
S302, according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set, determining the dependency relationship of the (k) th slice in the text to be labeled.
In a possible implementation manner, when k is greater than 1, the electronic device determines the dependency relationship of the kth slice in the text to be annotated, which may refer to the relevant description of step S202 and is not described herein again.
In some embodiments, when k is equal to 1, the vector representation of the k-1 th slice may be a preset first initial vector representation, the dependency relationship of the k-1 th slice is a preset second initial vector representation, and the undivided text is the text to be annotated.
When k is 1, that is, when the first round of division is performed, the slice set is empty, and thus the vector representation of the k-1 th slice may be a preset first initial vector representation (V)1) The dependency of the (k-1) th slice is a predetermined second initial vector representation (V)2) And the text which is not divided is the input label text. The first initial vector representation and the second initial vector representation may be the same or different, and the first initial vector representation and the second initial vector representation may be set by the relevant service personnel according to experience values, or may be obtained as model parameters in the model training process.
Thus in the kth round of partitioning, the vector representation of the kth slice can be expressed as:
Figure BDA0003161398530000141
that is, when k is 1, the dependency relationship of the k-th slice can be expressed as:
Figure BDA0003161398530000142
wherein, V1Vector representation, V, representing the 0 th slice2Indicating the dependency of the 0 th slice,
Figure BDA0003161398530000143
representing undivided text (i.e., annotated text).
S303, acquiring a candidate word segmentation set of the kth slice from the undivided text. For a specific implementation of step S303, reference may be made to the related description of step S203, which is not described herein again.
S304, determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set.
In one possible embodiment, the vector representation of the candidate segmented word may be obtained in the same manner as the text vector representation of the kth slice. Thus, in the kth round of partitioning, the electronic device may utilize the part-of-speech tagging model to determine a probability of each candidate participle from the dependency of the kth slice and the vector representation of each candidate participle of the set of candidate participles of the kth slice, namely:
Figure BDA0003161398530000144
wherein the content of the first and second substances,
Figure BDA0003161398530000145
representing the probability of the candidate participle corresponding to the k-th slice,
Figure BDA0003161398530000146
vector representation representing candidate participles, (i, j) position tags representing candidate participles, SkTo representSet of candidate participles for the k-th slice, WsAnd the model parameters representing the part of speech tagging model are used for calculating the candidate word segmentation probability.
And the electronic device determining the part of speech of the kth participle corresponding to the kth slice may be determining the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle. The electronic device may determine, by using a part-of-speech tagging model, a part-of-speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle in a specific manner, obtain a part-of-speech set and a vector representation of each part-of-speech in the part-of-speech set, determine, according to the dependency relationship of the kth slice, the vector representation of the kth participle, and the vector representation of each part-of-speech, a probability that the part-of-speech of the kth participle is each part-of-speech, and determine, according to the probability that the part-of-speech of the kth participle is each part-of-speech, the part-of-speech of the kth participle from the part-of-speech set. For example, the probability of determining a word segmentation may be:
Figure BDA0003161398530000151
wherein the part of speech set is L,
Figure BDA0003161398530000152
representing the probability of part of speech corresponding to the kth participle, WlParameters for calculating the part-of-speech probability among model parameters representing part-of-speech tagging models, El(l) A vector representation representing a part of speech.
It is understood that, in the model training process, when the probability of the sample slice and the probability of the sample part of speech are obtained, the above formulas (formula 2.4 and formula 2.5) can be used for calculation.
In one possible implementation manner, the electronic device may determine, according to the probability of each candidate participle, a kth participle from the candidate participle set in a manner that a candidate participle with the highest probability is determined as the kth participle; and the electronic device determines the part of speech of the kth participle from the part of speech set according to the probability that the part of speech of the kth participle is each part of speech, wherein the part of speech with the highest probability may be determined as the part of speech of the kth participle. Namely:
Figure BDA0003161398530000153
wherein the content of the first and second substances,
Figure BDA0003161398530000154
and the k slice is the position label and the part of speech label of the k slice which are predicted in the k round of division through the part of speech tagging model.
For example, please refer to fig. 4, fig. 4 is a schematic view of a scenario for obtaining a candidate participle set according to an embodiment of the present application, where a kth round of an undivided text is K1(4,9), a continuous subset of the kth round of the undivided text is obtained, and the continuous subset is determined as a candidate participle set of a kth slice, that is, S1; if the 3 rd candidate participle is a participle corresponding to the kth slice, the non-divided text of the (K + 1) th round is K2(7,9), a continuous subset including the first word is obtained, and the continuous subset is determined to be a candidate participle set of the (K + 1) th slice, that is, S2.
For another example, referring to fig. 5, fig. 5 is a schematic diagram of a scenario for obtaining a slice set provided in an embodiment of the present application, where a text to be labeled K1 is "flood discharge building and related engineering", when a part-of-speech tagging model is used to perform a 1 st round of segmentation, an undivided text is K1, the slice set is empty, a candidate participle set S1 of a 1 st slice is obtained, a probability of each candidate participle is obtained based on a vector representation of K1, a dependency relationship of a 0 th slice, and a vector representation of each candidate participle, and a part-of-speech corresponding to the 1 st slice is determined to be "flood discharge" from the candidate participle set based on the candidate participle probability, and a part-of-speech of the participle is determined to be a probability of each part-of-speech set, a part-of-speech of the participle is determined based on the probability of each part-of-speech, so that the 1 st slice can be "flood discharge", and a section "flood discharge" can be represented as (1,2, NN); when the 2 nd round division is performed, the non-divided text is K2, the slice set is [ flood discharge ], a candidate participle set S2 of the 2 nd slice is obtained, the probability of each candidate participle is obtained based on the vector representation of K2, the dependency relationship of the 1 st slice and the vector representation of each candidate participle, the participle corresponding to the 2 nd slice is determined to be a building from the candidate participle set based on the candidate participle probability, the part-of-speech of the participle is determined to be the probability of each part-of-speech in the part-of-speech set, and the part-of-speech of the participle is determined based on the probability of each part-of-speech, so that the 1 st slice can be obtained to be [ building ], and the slice [ building ] can be represented as (3,4, NN); continuously and iteratively dividing until the undivided text is empty, namely the undivided text is K5 in the 5 th round of division, the slice sets are flood discharge, mansion, and correlation, obtaining a candidate word segmentation set S5 of the 5 th slice, and deriving a probability for each candidate participle based on the vector representation of K5, the dependency of the 5 th slice and the vector representation of each candidate participle, determining that the participle corresponding to the 5 th slice is 'engineering' from the candidate participle set based on the candidate participle probability, and determining the part of speech of the participle as the probability of each part of speech in the part of speech set, determining the part of speech of the participle based on the probability of the part of speech, therefore, the 5 th slice can be obtained as (project), the slices can be represented as (8,9, NN), and the text which is not divided is empty at this time, that is, the part-of-speech tagging of the text to be tagged is completed; therefore, the set of slices that is finally output may be represented as Y ═ 1,2, NN), (3,4, NN),., (8,9, NN).
For another example, referring to fig. 6, fig. 6 is a scene diagram of performing part-of-speech tagging based on a model provided in an embodiment of the present application, where the part-of-speech tagging model includes an encoding portion and a decoding portion, the encoding portion may use a bidirectional LSTM model, the decoding model may use a unidirectional LSTM model, a text X to be tagged is input into the part-of-speech tagging model, the X is encoded by the bidirectional LSTM model to obtain a context representation of each word, and the X is decoded by the unidirectional LSTM model to perform R-round iterative partitioning on the X, a k-th slice is predicted and output in the k-th round of partitioning, a slice set and an undivided text obtained by a previous k-round of partitioning are obtained in the next round of partitioning, and a k + 1-th round of partitioning is performed based on a dependency relationship between the slice set, the undivided text and the k + 1-th slice obtained by the previous k-round of partitioning until the undivided text is empty finally, the final prediction output contains a slice set Y of all slices in the text to be annotated.
In the embodiment of the application, the electronic equipment can acquire an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in a text to be labeled, the text to be labeled is a text except the k-1 divided slices, the dependency relationship of the kth slice in the text to be labeled is determined according to the vector representation of the text to be labeled and the vector representation of the kth slice in the slice set, the candidate participle set of the kth slice is obtained in the text to be labeled, and the kth slice is determined from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set. By implementing the method provided by the embodiment of the application, in the kth division, the part-of-speech tagging model is used for determining the leftmost slice and the part-of-speech of the non-divided text to realize part-of-speech tagging of iteration increment, so that the time complexity and the workload can be reduced, the part-of-speech tagging efficiency can be improved, further, the dependency relationship of the kth slice is determined, the relevance of the slice set divided before can be captured during division to realize the prediction of the slice and the part-of-speech of the slice, and further, the part-of-speech tagging accuracy can be improved.
Please refer to fig. 7, fig. 7 is a schematic structural diagram of a part-of-speech tagging apparatus provided in the present application. It should be noted that, the part-of-speech tagging apparatus shown in fig. 7 is used for executing the method of the embodiment shown in fig. 2 and fig. 3 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the technology are not disclosed, and reference is made to the embodiment shown in fig. 2 and fig. 3 of the present application. The part-of-speech tagging apparatus 700 may include: an obtaining module 701 and a determining module 702. Wherein:
an obtaining module 701, configured to obtain an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
a determining module 702, configured to determine, according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set, a dependency relationship of the (k) th slice in the text to be labeled; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
the obtaining module 701 is further configured to obtain a candidate participle set of the kth slice in the non-divided text;
the determining module 702 is further configured to determine the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
In a possible implementation, the obtaining module 701 is further configured to:
acquiring the dependency relationship of the (k-1) th slice;
the determining module 702, when configured to determine the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 th slice in the slice set, is specifically configured to:
obtaining target vector representation according to the vector representation of the undivided text and the vector representation of the (k-1) th slice;
and determining the dependency relationship of the kth slice in the text to be labeled according to the dependency relationship between the target vector representation and the kth-1 slice.
In one possible embodiment, the participle corresponding to the kth slice is the kth participle; the determining module 702 is specifically configured to, when determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set,:
determining the probability of each candidate participle being the kth participle according to the dependency relationship of the kth slice and the vector representation of each candidate participle;
determining the kth participle from the candidate participle set according to the probability of each candidate participle;
determining the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle;
and determining the k slice according to the k participle and the part of speech of the k participle.
In a possible implementation, the determining module 702, when configured to determine the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle, is specifically configured to:
acquiring a part-of-speech set and vector representation of each part-of-speech in the part-of-speech set;
determining the probability that the part of speech of the kth participle is the each part of speech according to the dependency relationship of the kth slice, the vector representation of the kth participle and the vector representation of each part of speech;
and determining the part of speech of the kth participle from the part of speech set according to the probability that the part of speech of the kth participle is the part of speech of each part of speech.
In one possible embodiment, the participle corresponding to the k-1 th slice is a k-1 th participle; the determining module 702 is further configured to:
obtaining context representation of the (k-1) th participle;
determining a vector representation of the (k-1) th participle according to the context representation of the (k-1) th participle;
determining the vector representation of the k-1 th slice according to the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle; and the vector representation of the k-1 th slice is used for representing a fusion vector obtained by fusing the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle.
In a possible embodiment, if k is equal to 1, the vector of the k-1 th slice is represented by a preset first initial vector, the dependency relationship of the k-1 th slice is represented by a preset second initial vector, and the undivided text is the text to be labeled;
the determining module 702 is further configured to, when determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 th slice in the slice set, specifically:
and determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the text to be labeled, the first initial vector representation and the second initial vector representation.
In a possible implementation manner, the obtaining module 701, when configured to obtain the candidate word set of the kth slice in the non-partitioned text, is specifically configured to:
and acquiring a continuous subset comprising a first word in the non-divided text from the non-divided text, and determining the continuous subset as a candidate word segmentation set of the k slice.
In the embodiment of the application, an obtaining module obtains an undivided text and a slice set in a text to be labeled, wherein the slice set comprises k-1 divided slices in the text to be labeled, and the undivided text is a text except the k-1 slices in the text to be labeled; the determining module determines the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the acquisition module is further used for acquiring a candidate word segmentation set of the kth slice in the undivided text; the determining module determines the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the kth slice is used for indicating the participle corresponding to the kth slice and the part of speech of the participle. By implementing the device, in the k-th round of division, the leftmost slice and the part of speech of the slice in the un-divided text can be determined to realize part of speech tagging of iteration increment, so that the time complexity and the workload can be reduced, the part of speech tagging efficiency can be improved, furthermore, by determining the dependency relationship of the k-th slice, the relevance of the slice set divided before can be captured during division to realize the prediction of the part of speech of the slice, and further, the accuracy of the part of speech tagging can be improved.
Each functional module in the embodiments of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module may be implemented in a form of hardware, or may be implemented in a form of software functional module, which is not limited in this application.
Please refer to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 includes: at least one processor 801, a memory 802. Optionally, the electronic device may further include a network interface. Wherein data may be exchanged between the processor 801, the memory 802 and a network interface controlled by the processor 801 for transceiving messages, the memory 802 for storing a computer program comprising program instructions, the processor 801 for executing the program instructions stored by the memory 802. Wherein the processor 801 is configured to invoke the program instructions to perform the methods described above.
The memory 802 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 802 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 802 may also comprise a combination of the above-described types of memory.
The processor 801 may be a Central Processing Unit (CPU). In one embodiment, processor 801 may also be a Graphics Processing Unit (GPU). The processor 801 may also be a combination of a CPU and a GPU.
In one possible embodiment, the memory 802 is used to store program instructions that the processor 801 may call to perform the following steps:
acquiring an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
acquiring a candidate word segmentation set of the kth slice from the undivided text;
determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
In one possible implementation, the processor 801 is further configured to:
acquiring the dependency relationship of the (k-1) th slice;
the determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set includes:
obtaining target vector representation according to the vector representation of the undivided text and the vector representation of the (k-1) th slice;
and determining the dependency relationship of the kth slice in the text to be labeled according to the dependency relationship between the target vector representation and the kth-1 slice.
In one possible embodiment, the participle corresponding to the kth slice is the kth participle; the processor 801 is specifically configured to, when determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set,:
determining the probability of each candidate participle being the kth participle according to the dependency relationship of the kth slice and the vector representation of each candidate participle;
determining the kth participle from the candidate participle set according to the probability of each candidate participle;
determining the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle;
and determining the k slice according to the k participle and the part of speech of the k participle.
In a possible implementation, the processor 801, when configured to determine the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle, is specifically configured to:
acquiring a part-of-speech set and vector representation of each part-of-speech in the part-of-speech set;
determining the probability that the part of speech of the kth participle is the each part of speech according to the dependency relationship of the kth slice, the vector representation of the kth participle and the vector representation of each part of speech;
and determining the part of speech of the kth participle from the part of speech set according to the probability that the part of speech of the kth participle is the part of speech of each part of speech.
In one possible embodiment, the participle corresponding to the k-1 th slice is a k-1 th participle; the processor 801 is further configured to:
obtaining context representation of the (k-1) th participle;
determining a vector representation of the (k-1) th participle according to the context representation of the (k-1) th participle;
determining the vector representation of the k-1 th slice according to the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle; and the vector representation of the k-1 th slice is used for representing a fusion vector obtained by fusing the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle.
In a possible embodiment, if k is equal to 1, the vector of the k-1 th slice is represented by a preset first initial vector, the dependency relationship of the k-1 th slice is represented by a preset second initial vector, and the undivided text is the text to be labeled;
when the processor 801 is configured to determine the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 th slice in the slice set, specifically, the processor is configured to:
and determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the text to be labeled, the first initial vector representation and the second initial vector representation.
In a possible implementation, the processor 801, when configured to obtain the candidate word set of the kth slice in an undivided text, is specifically configured to:
and acquiring a continuous subset comprising a first word in the non-divided text from the non-divided text, and determining the continuous subset as a candidate word segmentation set of the k slice.
In a specific implementation, the parts-of-speech tagging apparatus 700, the processor 801, the memory 802 and the like described above may perform the implementation described in the foregoing method embodiment, and may also perform the implementation described in the embodiment of the present application, which is not described herein again.
Also provided in embodiments of the present application is a computer (readable) storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to perform some or all of the steps performed in the above-mentioned method embodiments. Alternatively, the computer storage media may be volatile or nonvolatile. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
Reference herein to "a plurality" means two or more. And reference to "and/or" describing an association relationship for an associated object, indicates that there may be three relationships, e.g., a and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. And the reference to the character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, and the aforementioned program can be stored in a computer storage medium, which can be a computer-readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the present disclosure has been described with reference to particular embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure.

Claims (10)

1. A part-of-speech tagging method, characterized in that the method comprises:
acquiring an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
acquiring a candidate word segmentation set of the kth slice from the undivided text;
determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
2. The method of claim 1, further comprising:
acquiring the dependency relationship of the (k-1) th slice;
determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set, including:
obtaining target vector representation according to the vector representation of the undivided text and the vector representation of the (k-1) th slice;
and determining the dependency relationship of the kth slice in the text to be labeled according to the dependency relationship between the target vector representation and the kth-1 slice.
3. The method according to claim 1, wherein the participle corresponding to the kth slice is the kth participle; determining the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set, including:
determining the probability of each candidate participle being the kth participle according to the dependency relationship of the kth slice and the vector representation of each candidate participle;
determining the kth participle from the candidate participle set according to the probability of each candidate participle;
determining the part of speech of the kth participle according to the dependency relationship of the kth slice and the vector representation of the kth participle;
and determining the k slice according to the k participle and the part of speech of the k participle.
4. The method of claim 3, wherein determining the part of speech of the kth participle from the dependency of the kth slice and the vector representation of the kth participle comprises:
acquiring a part-of-speech set and vector representation of each part-of-speech in the part-of-speech set;
determining the probability that the part of speech of the kth participle is the each part of speech according to the dependency relationship of the kth slice, the vector representation of the kth participle and the vector representation of each part of speech;
and determining the part of speech of the kth participle from the part of speech set according to the probability that the part of speech of the kth participle is the part of speech of each part of speech.
5. The method according to claim 1, wherein the participle corresponding to the k-1 th slice is a k-1 th participle; the method further comprises the following steps:
obtaining context representation of the (k-1) th participle;
determining a vector representation of the (k-1) th participle according to the context representation of the (k-1) th participle;
determining the vector representation of the k-1 th slice according to the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle; and the vector representation of the k-1 th slice is used for representing a fusion vector obtained by fusing the vector representation of the k-1 th participle and the vector representation of the part of speech of the k-1 th participle.
6. The method according to claim 1, wherein if k is equal to 1, the vector representation of the k-1 th slice is a preset first initial vector representation, the dependency relationship of the k-1 th slice is a preset second initial vector representation, and the undivided text is the text to be labeled;
determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the (k-1) th slice in the slice set, including:
and determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the text to be labeled, the first initial vector representation and the second initial vector representation.
7. The method of claim 1, wherein obtaining the set of candidate participles for the kth slice in the undivided text comprises:
and acquiring a continuous subset comprising a first word in the non-divided text from the non-divided text, and determining the continuous subset as a candidate word segmentation set of the k slice.
8. A part-of-speech tagging apparatus, the apparatus comprising:
the acquisition module is used for acquiring an undivided text and a slice set in a text to be labeled; the slice set comprises k-1 divided slices in the text to be labeled, and the text which is not divided is the text except the k-1 slices in the text to be labeled;
the determining module is used for determining the dependency relationship of the kth slice in the text to be labeled according to the vector representation of the undivided text and the vector representation of the kth-1 slice in the slice set; the dependency of the kth slice is used to represent the correlation of the kth slice with the set of slices;
the obtaining module is further configured to obtain a candidate word segmentation set of the kth slice from the undivided text;
the determining module is further configured to determine the kth slice from the candidate participle set according to the dependency relationship of the kth slice and the vector representation of each candidate participle in the candidate participle set; the k slice is used for indicating the participle corresponding to the k slice and the part of speech of the participle.
9. An electronic device comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
CN202110787205.9A 2021-07-13 2021-07-13 Part-of-speech tagging method and device, electronic equipment and storage medium Pending CN113468878A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110787205.9A CN113468878A (en) 2021-07-13 2021-07-13 Part-of-speech tagging method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110787205.9A CN113468878A (en) 2021-07-13 2021-07-13 Part-of-speech tagging method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113468878A true CN113468878A (en) 2021-10-01

Family

ID=77879923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110787205.9A Pending CN113468878A (en) 2021-07-13 2021-07-13 Part-of-speech tagging method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113468878A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344406A (en) * 2018-09-30 2019-02-15 阿里巴巴集团控股有限公司 Part-of-speech tagging method, apparatus and electronic equipment
CN109388801A (en) * 2018-09-30 2019-02-26 阿里巴巴集团控股有限公司 The determination method, apparatus and electronic equipment of similar set of words
US20190073351A1 (en) * 2016-03-18 2019-03-07 Gogle Llc Generating dependency parses of text segments using neural networks
CN110276066A (en) * 2018-03-16 2019-09-24 北京国双科技有限公司 The analysis method and relevant apparatus of entity associated relationship
CN111160030A (en) * 2019-12-11 2020-05-15 北京明略软件系统有限公司 Information extraction method, device and storage medium
CN111274358A (en) * 2020-01-20 2020-06-12 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and storage medium
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment
CN111832297A (en) * 2020-06-15 2020-10-27 北京小米松果电子有限公司 Part-of-speech tagging method and device and computer-readable storage medium
CN112069812A (en) * 2020-08-28 2020-12-11 喜大(上海)网络科技有限公司 Word segmentation method, device, equipment and computer storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190073351A1 (en) * 2016-03-18 2019-03-07 Gogle Llc Generating dependency parses of text segments using neural networks
CN110276066A (en) * 2018-03-16 2019-09-24 北京国双科技有限公司 The analysis method and relevant apparatus of entity associated relationship
CN109344406A (en) * 2018-09-30 2019-02-15 阿里巴巴集团控股有限公司 Part-of-speech tagging method, apparatus and electronic equipment
CN109388801A (en) * 2018-09-30 2019-02-26 阿里巴巴集团控股有限公司 The determination method, apparatus and electronic equipment of similar set of words
CN111160030A (en) * 2019-12-11 2020-05-15 北京明略软件系统有限公司 Information extraction method, device and storage medium
CN111274358A (en) * 2020-01-20 2020-06-12 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and storage medium
CN111832297A (en) * 2020-06-15 2020-10-27 北京小米松果电子有限公司 Part-of-speech tagging method and device and computer-readable storage medium
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment
CN112069812A (en) * 2020-08-28 2020-12-11 喜大(上海)网络科技有限公司 Word segmentation method, device, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN111222317B (en) Sequence labeling method, system and computer equipment
EP4131076A1 (en) Serialized data processing method and device, and text processing method and device
CN111310440B (en) Text error correction method, device and system
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109753661B (en) Machine reading understanding method, device, equipment and storage medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN110874536B (en) Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN112188311B (en) Method and apparatus for determining video material of news
CN114358201A (en) Text-based emotion classification method and device, computer equipment and storage medium
CN110929532B (en) Data processing method, device, equipment and storage medium
CN112270184A (en) Natural language processing method, device and storage medium
CN113761868A (en) Text processing method and device, electronic equipment and readable storage medium
CN115510232A (en) Text sentence classification method and classification device, electronic equipment and storage medium
CN113221553A (en) Text processing method, device and equipment and readable storage medium
CN111158692A (en) Method, system and storage medium for ordering similarity of intelligent contract functions
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
US20220139386A1 (en) System and method for chinese punctuation restoration using sub-character information
CN111241843A (en) Semantic relation inference system and method based on composite neural network
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN116127060A (en) Text classification method and system based on prompt words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination